PHP Kata: Bank OCR Numbers part


It is a continuation of a series on PHP Kata named Bank OCR. In this article, I demonstrate how to handle number parsing - our next step to finish the first user story.

Foreword

It is the second article in the series about solving PHP Kata Bank OCR. If you didn't read the previous one - please do it right now: PHP Kata: Bank OCR Introduction.

Same as in the previous article, I focus on architecture design. Like previously, I try to answer as many why questions as I can.

Other articles in the series

This section is only for you. It aggregates links to all other articles in this Bank OCR series:

The next puzzle

I love to work with small and straightforward goals, mostly because I like this feeling when I achieve them. So for this article, let set such one. Following our journey from the bottom to the top of our breakdown structure - the next goal can be: parse single file record into account number.

If you read the previous article carefully, you noticed that I already mentioned this goal. But back then, it was too big to be managed. Now I find it perfectly sized for us. And here we have our first why? A single record in a statement file is nothing more than 4 lines of text. Three of them are the encoded version of our number, and the last one is a record separator. We have something that can parse single digit - so parsing a line is as simple as executing it 9 times.

Number design

Same as previously, we start with a UML design for our solution. But to do it well, we need answers to the usual questions: what number has and what number does? I can think of the same things as for digit: two methods - the first one for getting a parsed number, and the second one for getting its raw version.

The next question is about the relationship between a number and a digit. I would say that there is no number without digits. In other words, digits are building blocks of numbers. So there is a strong cohesion between them - so the relationship they have is composition.

I think we now have enough knowledge to create an appropriate design. We can even reuse part of our previous design. Do you also think about DigitInterface? And I did it again; I forgot about quite an important thing - input and output types. Take a look on DigitInterface::get method. It returns integer or null. Why might this be a problem? Because of other requirements - and to be precise, this single one:

If some characters are illegible, they are replaced by a ?.

With integer as an output for number, it would be impossible to return ?. This is why we need to introduce a separate interface for numbers. And I think that NumberInterface::get should return a string.

We can sum it up with this UML diagram:

Digit UML diagram

It is time to develop some code.

Code a number

Before you start coding - make sure you understand our design. If you need time -take it. Think it through. Without well understanding of the design, it will be hard for you to code it well. Some time ago I even had an idea to prepare T-shirts with think twice, code once. If you are interested in buying such a T-shirt - send me an email. If there will be at least a few of you interested - I will make it happen.

Ok, skip the T-shirt topic and start with tests. Like always - the simplest one we can imagine:

<?php

declare(strict_types=1);

namespace WeBee\School\BankOcrKata\Spec;

use WeBee\School\BankOcrKata\Number\Number;

describe(
    'Number',
    function () {
        it(
            'can be instantiated',
            function () {
                $n = new Number();

                expect($n)->toBeAnInstanceOf('WeBee\School\BankOcrKata\Number\Number');
                expect($n)->toBeAnInstanceOf('WeBee\School\BankOcrKAta\Number\NumberInterface');
            }
        );
    }
);

And code that makes this test pass. Interface first:

<?php

declare(strict_types=1);

namespace WeBee\School\BankOcrKata\Number;

interface NumberInterface
{
}

and then class:

<?php

declare(strict_types=1);

namespace WeBee\School\BankOcrKata\Number;

class Number implements NumberInterface
{
}

Strict typing

You may think - why is he using strict types in the interface? The answer is straightforward and obvious - I have a code snippet defined for every new PHP file. And I'm too lazy to remove it from an interface. And as you probably know - and if not, I can explain it. Strict types declaration works only for function calls made within a file where strict types are declared. Because methods definitions declared in interfaces have no bodies - so nothing can do a function call. It is why I can safely leave strict type declaration.

You don't need to believe me - see below quote from PHP documentation about type declarations:

Strict typing applies to function calls made from within the file with strict typing enabled, not to the functions declared within that file. If a file without strict typing enabled makes a call to a function that was defined in a file with strict typing, the caller's preference (coercive typing) will be respected, and the value will be coerced.

Having this explained, we can go further with our code. Like with the Digit, we need to start from setting and getting a raw number. And again, the unit test goes first, then the code. The last step is to fix the previous test. Why? We added a parameter to the constructor. To be clear, for better readability, I make it in one code snippet. Test specification will look like this:

describe(
    'Number',
    function () {
        it(
            'can be instantiated',
            function () {
                $n = new Number('');

                expect($n)->toBeAnInstanceOf('WeBee\School\BankOcrKata\Number\Number');
                expect($n)->toBeAnInstanceOf('WeBee\School\BankOcrKAta\Number\NumberInterface');
            }
        );

        it(
            'can receive and return unparsed number',
            function () {
                $n = new Number('aaa');

                expect($n->getRaw())->toBeA('string');
                expect($n->getRaw())->toBe('aaa');
            }
        );
    }
);

Interface and code can be as follow:

interface NumberInterface
{
    public function getRaw(): string;
}
class Number implements NumberInterface
{
    public function __construct(private string $rawNumber)
    {
    }

    public function getRaw(): string
    {
        return $this->rawNumber;
    }
}

Parsing a number

Finally, it is time to parse a number. I like simplicity. How do you feel about such a number: "000000000". Nothing more than nine zeros. To get the parsed result, we need to implement a get method from our UML diagram. Let's do it!

Also, the test is as simple as usual. But this time, test data needs to mimic actual data. How to do it? The answer is in requirements:

Your first task is to write a program that can take this file and parse it into actual account numbers.

Yes, you got it correctly. We need to have a file with our test number. Please, create a simplified version of a file with account numbers to parse. Put nine zeros into it. It should look like this:

 _  _  _  _  _  _  _  _  _
| || || || || || || || || |
|_||_||_||_||_||_||_||_||_|

Now, save it in spec/test_files/Numbers/000000000.txt file. And we are good to go with test code:

        given(
            'testNumbers',
            function () {
                return [
                    '000000000' => file_get_contents(__DIR__.'//test_files//Numbers//000000000.txt'),
                ];
            }
        );

        it(
            'can correctly parse numbers',
            function () {
                foreach ($this->testNumbers as $expect => $given) {
                    $n = new Number($given);

                    expect($n->get())->toBeA('string');
                    expect($n->get())->toBe((string) $expect);
                }
            }
        );

What is a benefit from such a test construct? Each time we prepare a new report for testing purposes - we only need to add it as the next element of the testNumber array.

Now we develop code we can that can pass this test. We utilize the same concept as for digits:

interface NumberInterface
{
    public function getRaw(): string;

    public function get(): string;
}
class Number implements NumberInterface
{
    private string $parsed;

    public function __construct(private string $rawNumber)
    {
        $this->parse();
    }

    public function getRaw(): string
    {
        return $this->rawNumber;
    }

    public function get(): string
    {
        return $this->parsed;
    }

    private function parse(): void
    {
        $this->parsed = '000000000';
    }
}

The above code works - but unfortunately, only for "000000000" account number. We must fix it.

Number parsing algorithm

One image is worth more than a thousand words. Look at the one below: 000000000 number graphical representation

We already have a code that can parse single digit. It is why our task now is to extract each digit from a number. And this is nothing more than splitting each line into three characters' long parts and combining corresponding ones into a single line. For now, we must focus only on making it work. We refactor it later to make it also pretty.

    private function parse(): void
    {
        $lines = explode("\n", $this->rawNumber);
        // raw number has 4 lines - but we do not need the last one
        $lines = array_slice($lines, 0, 3);
        $digits = [];

        foreach ($lines as $line) {
            // we must be sure that our line is exactly 27 characters long
            $line = str_pad($line, 27, ' ');
            $lineBlocks = str_split($line, 3);
            $index = 0;
            foreach ($lineBlocks as $lineBlock) {
                if (!isset($digits[$index])) {
                    $digits[$index] = '';
                }
                $digits[$index++] .= "$lineBlock\n";
            }
        }

        $this->parsed = '';

        foreach ($digits as $rawDigit) {
            $digit = new Digit($rawDigit);
            $this->parsed .= $digit->get();
        }
    }

The above is dirty, but it does the trick, and this is our short-term goal. Try it by executing our tests. Before we start refactoring, we must add few more test cases - to be 100% sure that everything is working correctly. I like when a single test covers as many possibilities as possible. So what do you think about such a number: "123456789". Create another test file:

    _  _     _  _  _  _  _
  | _| _||_||_ |_   ||_||_|
  ||_  _|  | _||_|  ||_| _|

Save it to spec/test_files/Numbers/123456789.txt and add to the test:

        given(
            'testNumbers',
            function () {
                return [
                    '000000000' => file_get_contents(__DIR__.'//test_files//Numbers//000000000.txt'),
                    '123456789' => file_get_contents(__DIR__.'//test_files//Numbers//123456789.txt'),
                ];
            }
        );

Now we are ready for refactoring.

Number parser refactoring

The biggest problem with our parser method is that it is violating the Single Responsibility Principle. We need to try to split it. I propose to extract this as different methods: lines processing and digits processing. The most exciting part is that we can experiment with our code safely, as we have tests to prove that our changes didn't break a thing.

First extraction

So how are you about such code as step one:

    private const EXPECTED_LINE_LENGTH = 27;

    private function parse(): void
    {
        $lines = $this->parseLines();
        $digits = [];

        foreach ($lines as $lineBlocks) {
            $index = 0;
            foreach ($lineBlocks as $lineBlock) {
                if (!isset($digits[$index])) {
                    $digits[$index] = '';
                }
                $digits[$index++] .= "$lineBlock\n";
            }
        }

        $this->parsed = '';

        foreach ($digits as $rawDigit) {
            $digit = new Digit($rawDigit);
            $this->parsed .= $digit->get();
        }
    }

    private function parseLines(): array
    {
        $parsedLines = array_slice(explode("\n", $this->rawNumber), 0, 3);

        foreach ($parsedLines as &$line) {
            $line = str_split(str_pad($line, self::EXPECTED_LINE_LENGTH, ' '), 3);
        }

        return $parsedLines;
    }

We have a separate method responsible only for transforming lines into line blocks and nothing else. Also, our code is less mysterious, thanks to replacing 27 with a constant that exactly describes what 27 is.

Second extraction

Time to extract digits parsing. Our code can look like this:

    private function parse(): void
    {
        $this->parseDigits($this->parseLines());
    }

    private function parseDigits(array $lines): void
    {
        $this->parsed = '';

        foreach ($lines[0] as $index => $block) {
            $rawDigit = implode(
                "\n",
                [$block ?? '', $lines[1][$index] ?? '', $lines[2][$index] ?? '']
            );
            $digit = new Digit($rawDigit);
            $this->parsed .= $digit->get();
        }
    }

Now we have a method that is responsible only for parsing digits and nothing more. Also, we simplified the parse method - now, it is one line long. And as a cherry on top - we reduce the number of loops.

We need to do one last thing - fix the coding standards we agreed to. It is as complicated as executing the single command:

./vendor/bin/php-cs-fixer fix --diff

And that's it!

To make the first requirement fully implemented, we need something more - but it is a story for the following article.

Keep calm and code with WeBee!

Full code listing

You can find full code for this Bank OCR Kata in PHP on my github repository.

Not enough?


Would you like me to build and conduct training tailored to your needs? I'm waiting for your call or message.

address icon

WeBee.Online Adam Wojciechowski
ul. Władysława Łokietka 5/2
70-256 Szczecin, Poland