Implementing programmatic file transformations in PHP

Saturday, July 24, 2021

Recently, I released Scribe version 3. It introduced a few changes to the config file, and it came with a little utility: run php artisan scribe:upgrade to automatically migrate your config file from v2 to v3. Here's how that works:

The problem

Major releases are tricky. You can break backwards compatibility, but if you make too many BC breaks, you make it harder (and less likely) for users to upgrade. Nobody likes having to go through their codebase and manually change usages of X to Y. Which is why I love it when tools provide some way to automate this process (for example, recent versions of PHPUnit come with a command to migrate your config file).

A second reason I wanted automated upgrades was discoverability. When you introduce a new feature that's turned on in the config file, most users don't know about it. Manual migration guides only list required changes, so users might upgrade manually without finding out about the new feature. With automation, we can add the new items to their config, turning it on (as long as it doesn't break anything), and making it easier for them to discover.

So let's build this.

Assumptions

I planned to make a tool I could publish for others, but first, I wanted to satisfy my own use case well, so I made some assumptions about the config file:

  • It returns a key-value array, which can contain nested arrays
  • Config keys shouldn't have dots in their names, since we use dot notation to refer to them (a.b means item b in a ).
  • Config keys should be simple strings, not dynamic (like a function call)
  • There are three kinds of changes: moved (or renamed) items, added items, and removed items.

Defining the changes

First thing: define the changes. In my case, there were only two breaking changes for most end-users, plus some added features. Here's a truncated sample of a user's v2 config file:

use Knuckles\Scribe\Extracting\Strategies;
use App\Docs\Strategies\AddPaginationParameters;

return [
    'routes' => [ 
        // Nested array containing user-specified data
    ],
    'interactive' => false,
    'example_languages' => ['bash', 'php', 'javascript'],
    'continue_without_database_transactions' => ['pgsl'],
    'strategies' => [
        'urlParameters' => [
            Strategies\UrlParameters\GetFromLaravelAPI::class,
        ],
        'queryParameters' => [
            AddPaginationParameters::class, // This is a custom user value; the rest are defaults
            Strategies\QueryParameters\GetFromFormRequest::class,
            Strategies\QueryParameters\GetFromQueryParamTag::class,
        ],
        'bodyParameters' => [
            Strategies\BodyParameters\GetFromFormRequest::class,
            Strategies\BodyParameters\GetFromBodyParamTag::class,
        ],
    ],
    // other items ...
];

And here's a sample of the default v3 config file:

use Knuckles\Scribe\Extracting\Strategies;

return [
    'routes' => [ 
        // Sample values
    ],
    'try_it_out' => [
      'enabled' => true,
      'base_url' => null,
    ],
    'example_languages' => ['bash', 'javascript'],
    'database_connections_to_transact' => [config('database.default')],
    'strategies' => [
        'urlParameters' => [
            Strategies\UrlParameters\GetFromLaravelAPI::class,
        ],
        'queryParameters' => [
            Strategies\QueryParameters\GetFromFormRequest::class,
            Strategies\QueryParameters\GetFromInlineValidator::class, // New
            Strategies\QueryParameters\GetFromQueryParamTag::class,
        ],
        'bodyParameters' => [
            Strategies\BodyParameters\GetFromFormRequest::class,
            Strategies\BodyParameters\GetFromInlineValidator::class, // New
            Strategies\BodyParameters\GetFromBodyParamTag::class,
        ],
    ],
    // other items unchanged...
];

So, when we upgrade the user's config, it should become this in v3:

use Knuckles\Scribe\Extracting\Strategies;
use App\Docs\Strategies\AddPaginationParameters;

return [
    'routes' => [ 
        // Keep the user's values
    ],
    'try_it_out' => [
      'enabled' => false, // Whatever was in `interactive` goes here
      'base_url' => null,
    ],
    'example_languages' => ['bash', 'php', 'javascript'], // Keep the user's values
    'database_connections_to_transact' => [config('database.default')],
    'strategies' => [
        'urlParameters' => [
            Strategies\UrlParameters\GetFromLaravelAPI::class,
        ],
        'queryParameters' => [
            AddPaginationParameters::class, // Keep the user's values
            Strategies\QueryParameters\GetFromFormRequest::class,
            Strategies\QueryParameters\GetFromInlineValidator::class, // New in v3
            Strategies\QueryParameters\GetFromQueryParamTag::class,
        ],
        'bodyParameters' => [
            Strategies\BodyParameters\GetFromFormRequest::class,
            Strategies\BodyParameters\GetFromInlineValidator::class, // New in v3
            Strategies\BodyParameters\GetFromBodyParamTag::class,
        ],
    ],
    // other items unchanged...
];

The main changes:

  • Some new items were added in strategies. For instance, GetFromInlineValidator was added under queryParameters.
  • interactive was replaced with try_it_out. Whatever value was in interactive should go in try_it_out.enabled.
  • continue_without_database_transactions was removed and replaced with database_connections_to_transact. The two items mean different things though, so it's not a direct replacement.

All of these changes could be automated fully, except for the last, since the logic of the new item is different from the old. For that item, my plan was to replace the old item with the new one set to the default ([config('database.default')], which evaluates to the user's default database transactions). But we had been issuing deprecation warnings for the old item in earlier v2 releases, so hopefully most users had already made the change manually.

Designing the API

Next, I spent some time mulling over the API for the upgrade tool. I wanted two major things:

  • Give it a sample of the new config file, and it would figure out the items that were added and those removed
  • A simple, fluent API so you could tell it the items that were moved.

It took a lot of iteration, but I ended up with something like this:

$upgrader = Upgrader::ofConfigFile('config/scribe.php', __DIR__ . '/../../config/scribe.php')
  ->dontTouch('routes', 'example_languages')
  ->move('interactive', 'try_it_out.enabled');

if ($this->option('dry-run')) {
   $changes = $upgrader->dryRun();
   // Print changes
   return;
 }

$upgrader->upgrade();

The upgrade command made use of an Upgrader class that handled the logic behind upgrades.

  • The first argument to Upgrader::ofConfigFile() tells the upgrader where to look for the v2 config file in the user's project. The second argument is the path to our sample v3 config file.
  • The ->move() call told the upgrader that the value of interactive should be moved to try_it_out.enabled.
  • The ->dontTouch() call tells the upgrader to ignore the routes and example_languages items because they are user-defined. More on that later.

Dry run

I decided to start by implementing a --dry-run option. A "dry run" is a way of showing a user what would happen if they did a certain action, without actually executing the action.

There are two major options to implement a dry run. In the first, the dry run and actual execution use the same logic for figuring out the changes needed. This means that when you run the upgrade command, it figures out the changes. Then, if it's a dry run, it prints the changes; otherwise, it applies them. The benefit here is that the logic is always in sync. This is the option I chose.

The other approach is to have dry run and actual execution use different change-calculation logic. One reason you might want this is if the logic for calculating user-facing changes is somewhat slow; you could then use a faster, different algorithm for actual execution. However, you must remember to update the dry run logic whenever you change the execution logic.

Okay, let's get to work. First, I'll add the basic code for the Upgrader class:

class Upgrader 
{
    protected array $configFiles = [];
    protected array $movedKeys = [];
    protected array $dontTouchKeys = [];

    public function __construct(string $userOldConfigRelativePath, string $sampleNewConfigAbsolutePath)
    {
        $this->configFiles['user_old'] = $userOldConfigRelativePath;
        $this->configFiles['sample_new'] = $sampleNewConfigAbsolutePath;
    }

    public static function ofConfigFile(string $userOldConfigRelativePath, string $sampleNewConfigAbsolutePath): self
    {
        return new self($userOldConfigRelativePath, $sampleNewConfigAbsolutePath);
    }

    public function move(string $oldKey, string $newKey): self
    {
        $this->movedKeys[$oldKey] = $newKey;
        return $this;
    }
    
    /**
     * "Don't touch" these config items. 
     * Useful if they contain arrays with keys specified by the user, 
     * or lists with values provided entirely by the user
     */
    public function dontTouch(string ...$keys): self
    {
        $this->dontTouchKeys += $keys;
        return $this;
    }

}

After this, my first implementation of dry run was pretty straightforward:

class Upgrader 
{
    public const CHANGE_REMOVED = 'removed';
    public const CHANGE_MOVED = 'moved';
    public const CHANGE_ADDED = 'added';
    
    protected array $changes = [];
    // ...

    public function dryRun(): array
    {
        [$userCurrentConfig, $sampleNewConfig] = $this->loadConfigs();

        $this->fetchAddedItems($userCurrentConfig, $sampleNewConfig);
        $this->fetchRemovedAndMovedItems($userCurrentConfig, $sampleNewConfig);

        return $this->changes;
    }

    public function loadConfigs(): array
    {
        $userCurrentConfig = require $this->configFiles['user_old'];
        $incomingConfig = require $this->configFiles['sample_new'];

        return [$userCurrentConfig, $incomingConfig];
    }

    protected function fetchAddedItems(array $userCurrentConfig, array $incomingConfig)
    {
        // Loop over the new config
        foreach ($incomingConfig as $key => $value) {
            if ($this->shouldntTouch($key)) {
                continue;
            }

            // Key is in new, but not in old
            if (!array_key_exists($key, $userCurrentConfig)) {
                $this->changes[] = [
                    'type' => self::CHANGE_ADDED,
                    'key' => $key,
                    'description' => "- `{$key}` will be added.",
                    'value' => $value,
                ];
            }

        }
    }
    
    protected function fetchRemovedAndMovedItems(array $userCurrentConfig, $incomingConfig)
    {
        // Loop over the old config
        foreach ($userCurrentConfig as $key => $value) {
        
            // Key is in old, but was moved somewhere else in new
            if ($this->wasKeyMoved($key)) {
                $this->changes[] = [
                    'type' => self::CHANGE_MOVED,
                    'key' => $key,
                    'new_key' => $this->movedKeys[$key],
                    'description' => "- `$key` will be moved to `{$this->movedKeys[$key]}`.",
                ];
                continue;
            }

            // Key is in old, but not in new
            if (!array_key_exists($key, $incomingConfig)) {
                $this->changes[] = [
                    'type' => self::CHANGE_REMOVED,
                    'key' => $key,
                    'description' => "- `$key` will be removed.",
                ];
            }
        }
    }
    
    protected function wasKeyMoved(string $oldKey): bool
    {
        return array_key_exists($oldKey, $this->movedKeys);
    }
    
    protected function shouldntTouch(string $key): bool
    {
        return in_array($key, $this->dontTouchKeys);
    }
}

What's going on here? First, we load the old and new config files (using require $file). This gives us the old and new configs, which we then compare to find changes:

  • In fetchAddedItems(), we loop over the items in the new config. Any item which doesn't exist in the old config is marked as "added".
  • Similarly, any items in the old config that aren't in the new config are marked as "removed".
  • Finally, the items in movedKeys are marked as "moved" (but only if they were in the old config).

This is pretty simple, and it works. When I run dryRun(), and print out the changes, I get:

- `try_it_out` will be added.
- `database_connections_to_transact` will be added.
- `interactive` will be moved to `try_it_out.enabled`.
- `continue_without_database_transactions` will be removed.

Handling nested items

The first attempt at dry run works, but it isn't done. Our current setup only loops over the top-level items in the config. If, in v4, I add a key called value inside try_it_out, the upgrader won't detect it as an added item.

This means we have to do a bit of recursion: if the config item is an array with string keys (ie not a list of items), we'll loop over its keys and check for changes in the same manner (added/removed). This means our methods to fetch changes now look like this:

    protected function fetchAddedItems(array $userCurrentConfig, array $incomingConfig, string $rootKey = '')
    {
        foreach ($incomingConfig as $key => $value) {
            $fullKey = $this->getFullKey($key, $rootKey);
            if ($this->shouldntTouch($fullKey)) {
                continue;
            }

            // Key is in new, but not in old
            if (!array_key_exists($key, $userCurrentConfig)) {
                $this->changes[] = [
                    'type' => self::CHANGE_ADDED,
                    'key' => $fullKey,
                    'description' => "- `{$fullKey}` will be added.",
                    'value' => $value,
                ];
            } else {
                if (is_array($value)) {
                    // Key is in both old and new; recurse into array and compare the inner items
                    $this->fetchAddedItems($userCurrentConfig[$key] ?? [], $value, $fullKey);
                }
            }

        }
    }

    protected function fetchRemovedAndMovedItems(array $userCurrentConfig, $incomingConfig, string $rootKey = '')
    {
        // Loop over the old config
        foreach ($userCurrentConfig as $key => $value) {
            $fullKey = $this->getFullKey($key, $rootKey);

            // Key is in old, but was moved somewhere else in new
            if ($this->wasKeyMoved($fullKey)) {
                $this->changes[] = [
                    'type' => self::CHANGE_MOVED,
                    'key' => $fullKey,
                    'new_key' => $this->movedKeys[$fullKey],
                    'description' => "- `$fullKey` will be moved to `{$this->movedKeys[$fullKey]}`.",
                ];
                continue;
            }

            // Key is in old, but not in new
            if (!array_key_exists($key, $incomingConfig)) {
                $this->changes[] = [
                    'type' => self::CHANGE_REMOVED,
                    'key' => $fullKey,
                    'description' => "- `$fullKey` will be removed.",
                ];
                continue;
            }

            if (!$this->shouldntTouch($fullKey) && is_array($value)) {
                // Key is in both old and new; recurse into array and compare the inner items
                // Important: we called `shouldntTouch` first to be sure it isn't user-specified values
                $this->fetchRemovedAndMovedItems($value, $incomingConfig[$key] ?? [], $fullKey);
            }
        }
    }

    private function getFullKey(string $key, string $rootKey = ''): string
    {
        if (empty($rootKey)) {
            return $key;
        }

        return "$rootKey.$key";
    }

The logic of the recursion here is simple: if an item is an array, then enter into it and loop over its keys, and do the usual old vs new checks. We've also introduced the concept of the $fullKey, which is the...well, full key to the config item, using dot notation. For instance, when we enter the try_it_out array, the full keys are try_it_out.enabled, and try_it_out.base_url, rather than just enabled and base_url. This allows us to easily check whether an item was marked as "moved".

Also, note the if (!$this->shouldntTouch($fullKey)) condition we added: the goal of this is to avoid mistakenly modifying user-specified data (which we specified in ->dontTouch() earlier). For instance, if there's a config item, custom_things, where the user can put an array with keys and values, we don't want to loop over those keys and compare with our sample, because they definitely won't be in our sample. We just need to preserve the user's entered values and carry on.

Lists vs maps

We've improved our tool, but now it also picks up false positives! If I run it now, I get:

- `try_it_out` will be added.
- `database_connections_to_transact` will be added.
- `interactive` will be moved to `try_it_out.enabled`.
- `continue_without_database_transactions` will be removed.
- `example_languages.2` will be removed.
- `strategies.bodyParameters.2` will be added.

You'll notice that the upgrade tool now recognises the extra item that was added to the strategies.bodyParameters array, but wrongly: it simply looped over the numeric keys in the sample array (0, 1, 2) and compared with the keys in the old one (0, 1), and realized "okay, they added a new key", when we actually added a new value to the list.

For the same reason, it doesn't catch the new item introduced in strategies.queryParameters: the user's old config has keys 0, 1, 2, (because the user added an extra item, AddPaginationParameters) and the new sample has the same keys. And simlarly, because it sees keys 0, 1, 2 in the user's example_languages, but only sees 0 and 1 in the v3 sample, it assumes the third item has been removed. This is wrong.

This is a side effect of PHP arrays serving as both maps (associative arrays) lists. But we need to treat them differently. Here's what we really want: for numeric lists, we want to know, "What values are being added to this list?", not "What keys are being added?" This means we must adjust our code to handle lists correctly. A basic way of doing this is to use array_diff to compare the two arrays. This works, but has a limitation we'll see later.

class Upgrader
{
    // We add a new type of change
    public const CHANGE_LIST_ITEM_ADDED = 'added_to_list';
    
    protected function fetchAddedItems(array $userCurrentConfig, array $incomingConfig, string $rootKey = '')
    {
        if (is_array($incomingConfig)) {
            $arrayKeys = array_keys($incomingConfig);
            if (($arrayKeys[0] ?? null) === 0) {
                // We're dealing with a list of items (numeric array)
                $diff = array_diff($incomingConfig, $userCurrentConfig);
                if (!empty($diff)) {
                    foreach ($diff as $item) {
                        $this->changes[] = [
                            'type' => self::CHANGE_ARRAY_ITEM_ADDED,
                            'key' => $rootKey,
                            'value' => $item,
                            'description' => "- '$item' will be added to `$rootKey`.",
                        ];
                    }
                }
                return;
            }
        }

        foreach ($incomingConfig as $key => $value) {
            // Loop over the array keys as normal, if it's not a list
        }
    }

    
    protected function fetchRemovedAndMovedItems(array $userCurrentConfig, $incomingConfig, string $rootKey = '')
    {
        if (is_array($incomingConfig)) {
            $arrayKeys = array_keys($incomingConfig);
            if (($arrayKeys[0] ?? null) === 0) {
                // A list of items (numeric array). We only add new items, not remove.
                return;
            }
        }

        foreach ($userCurrentConfig as $key => $value) {
            // Loop over the array keys as normal, if it's not a list
        }
    }

}

Note that I'm checking if the first key in the array is 0 as a way to guess that it's a list. PHP 8.1 introduces the array_is_list function for this instead.

And now, when I run it:

- `try_it_out` will be added.
- 'Knuckles\Scribe\Extracting\Strategies\QueryParameters\GetFromInlineValidator' will be added to `strategies.queryParameters`.
- 'Knuckles\Scribe\Extracting\Strategies\BodyParameters\GetFromInlineValidator' will be added to `strategies.bodyParameters`.
- `database_connections_to_transact` will be added.
- `interactive` will be moved to `try_it_out.enabled`.
- `continue_without_database_transactions` will be removed.

Voila! All the changes shown are the actual changes.

Interlude: tests

Alright, we've got a good implementation of dry run. We've probably missed some things, but we'll come back later. Before we implement the actual upgrade logic, let's take a moment to add some tests.

Now, testing is a pretty opinionated subject. There are different philosophies, styles, and schools of thought that all swear they're the right one. So the tests I'll describe here may not be what you expect or approve of. Here's a quick overview of my testing decisions for this project:

  1. Feature tests over unit tests.
  2. I'll be testing that dryRun() can detect the different kinds of changes, and that its behaviour is correctly modified by dontTouch() and move().
  3. My tests will check that the return value of dryRun() matches what's expected, ignoring the description field, because that's a text string that I might change in the future.

I'm using PestPHP (which is built on PHPUnit). Let's add a first test that we can detect added/removed changes:

it("can detect items added to/removed from maps", function () {
    $userOldConfig = [
        'map' => ['removed' => 1],
        'nested' => ['map' => ['baa' => 'baa', 'black' => 'sheep', 'and' => 'more']]
    ];
    $sampleNewConfig = [
        'map' => ['added' => 1],
        'nested' => ['map' => ['baa' => 'baa', 'black' => 'sheep']]
    ];
    $mockUpgrader = mockUpgraderWithConfigs($userOldConfig, $sampleNewConfig);

    $changes = $mockUpgrader->dryRun();

    expect($changes)->toHaveCount(3);
    $this->assertArraySubset([
        [
            'type' => Upgrader::CHANGE_ADDED,
            'key' => 'map.added',
        ],
        [
            'type' => Upgrader::CHANGE_REMOVED,
            'key' => 'map.removed',
        ],
        [
            'type' => Upgrader::CHANGE_REMOVED,
            'key' => 'nested.map.and',
        ],
    ], $changes);
});

Here's the mockUpgraderWithConfigs() function. It returns an Upgrader instance that I've customised with Mockery to override the loadConfigs() function so it returns the configs I want to test. This way, I don't have to create and parse multiple config files.

/**
 * @return \Mockery\Mock&Upgrader
 */
function mockUpgraderWithConfigs(array $userOldConfig = [], array $sampleNewConfig = [])
{
    $mockUpgrader = Mockery::mock(Upgrader::class)->makePartial();
    $mockUpgrader->shouldReceive('loadConfigs')
        ->andReturn([
            $userOldConfig,
            $sampleNewConfig,
        ]);
    return $mockUpgrader;
}

Let's add another test; this one for detecting items added to lists:

it("can detect items added to lists", function () {
    $mockUpgrader = mockUpgraderWithConfigs(
        ['list' => ['baa', 'baa', 'black'], 'nested' => ['list' => []]],
        ['list' => ['baa', 'baa', 'black', 'sheep'], 'nested' => ['list' => ['item']]]
    );
    $changes = $mockUpgrader->dryRun();

    expect($changes)->toHaveCount(2);
    $this->assertArraySubset([
        [
            'type' => Upgrader::CHANGE_ARRAY_ITEM_ADDED,
            'key' => 'list',
            'value' => 'sheep',
        ],
        [
            'type' => Upgrader::CHANGE_ARRAY_ITEM_ADDED,
            'key' => 'nested.list',
            'value' => 'item',
        ],
    ], $changes);
});

Another test, this one to verify that dontTouch() works as expected:

it("ignores items marked as dontTouch()", function () {
    $userOldConfig = [
        'specified_by_user' => ['baa', 'baa', 'black'],
        'map' => ['user_specified' => 'some_value'],
    ];
    $sampleNewConfig = [
        'specified_by_user' => ['baa', 'baa', 'black', 'sheep'],
        'map' => ['just_a_sample' => 'a_value'],
    ];
    $changes = mockUpgraderWithConfigs($userOldConfig, $sampleNewConfig)
        ->dryRun();

    expect($changes)->toHaveCount(3);
    $this->assertArraySubset([
        [
            'type' => Upgrader::CHANGE_ARRAY_ITEM_ADDED,
            'key' => 'specified_by_user',
            'value' => 'sheep',
        ],
        [
            'type' => Upgrader::CHANGE_ADDED,
            'key' => 'map.just_a_sample',
            'value' => 'a_value',
        ],
        [
            'type' => Upgrader::CHANGE_REMOVED,
            'key' => 'map.user_specified',
        ],
    ], $changes);

    $changes = mockUpgraderWithConfigs($userOldConfig, $sampleNewConfig)
        ->dontTouch('specified_by_user', 'map')->dryRun();

    expect($changes)->toBeEmpty();
});

My final test for now verifies that move() works: if the old item was present in the user's config, the "moved" change should be reported; otherwise, it shouldn't.

it("reports items marked with move(), only if present in old config", function () {
    $userOldConfig = [
        'old' => [
            'item' => 'value',
            'other_item' => 'other_value'
        ]
    ];
    $sampleNewConfig = [
        'old' => [
            'other_item' => 'other_value'
        ],
        'new_item' => 'default'
    ];
    $changes = mockUpgraderWithConfigs($userOldConfig, $sampleNewConfig)
        ->move('old.item', 'new_item')->dryRun();
    expect($changes)->toHaveCount(2);
    $this->assertArraySubset([
        [
            'type' => Upgrader::CHANGE_ADDED,
            'key' => 'new_item',
        ],
        [
            'type' => Upgrader::CHANGE_MOVED,
            'key' => 'old.item',
            'new_key' => 'new_item',
        ],
    ], $changes);

    $userOldConfig = [
        'old' => [
            'other_item' => 'other_value'
        ]
    ];
    $changes = mockUpgraderWithConfigs($userOldConfig, $sampleNewConfig)
        ->move('old.item', 'new_item')->dryRun();
    $this->assertArraySubset([
        [
            'type' => Upgrader::CHANGE_ADDED,
            'key' => 'new_item',
        ],
    ], $changes);
});

Tests pass, and great! We've got some working tests that cover our dry run functionality.

Applying changes

Ah yes! We're finally here. We've figured out the changes we need, now let's apply them to the user's config file.

As usual, we'll start with a verrry basic version:

    protected function applyChanges()
    {
        $userConfig = $this->loadConfigs()[0];

        foreach ($this->changes as $change) {
            switch ($change['type']) {
                case self::CHANGE_ADDED:
                    data_set($userConfig, $change['key'], $change['value']);
                    break;
                case self::CHANGE_REMOVED:
                    $parts = explode('.', $change['key']);
                    $child = array_pop($parts);
                    // We're using references (&) so we operate on the actual array, not a copy
                    $parent = &$userConfig;
                    foreach ($parts as $part) {
                        $parent = &$parent[$part];
                    }
                    unset($parent[$child]);
                    break;
                case self::CHANGE_MOVED:
                    // Move old value in new key
                    data_set($userConfig, $change['new_key'], data_get($userConfig, $change['key']));
                    // Then delete old key
                    $parts = explode('.', $change['key']);
                    $child = array_pop($parts);
                    $parent = &$userConfig;
                    foreach ($parts as $part) {
                        $parent = &$parent[$part];
                    }
                    unset($parent[$child]);
                    break;
                case self::CHANGE_LIST_ITEM_ADDED:
                    // Get the old list and add the new item
                    $items = array_merge(data_get($userConfig, $change['key']), [$change['value']]);
                    // Then update the key with the updated list
                    data_set($userConfig, $change['key'], $items);
                    break;
            }
        }

        ray($userConfig);
    }
  • We're loading the user's old config, then looping over the changes we've computed and applying each one.
  • I'm using Laravel's data_get() and data_set() helpers (which you can get by installing illuminate/support) to make it easy to set/fetch items from arrays with dot notation. data_get($array, 'a.b') is the same as $array['a']['b'].
  • Finally, I dump the new config using Ray, so I can verify the results.

So, I try this, and it works!

Well, sort of. It technically worked, but there's a big problem: we've evaluated everything. By loading the config files using require, we executed them and turned them into PHP values. This means that:

  1. Any comments that were in that file are gone.
  2. All the ::class constants were evaluated into the full class names. For instance, AddPaginationParameters::class now shows as the full, longer class name, App\Docs\Strategies\AddPaginationParameters.
  3. Function calls were executed. For instance, database_connections_to_transact shows ['sqlite'] instead of [config('database.default')].

These are bad, because:

  1. Comments in the config file serve as mini-docs for what a config option does. We should not discard those.
  2. ::class shortcuts and the full class names are technically the same, but full class names are longer and more annoying to scroll through.
  3. Function calls in the config should not be executed during the upgrade, but when the config is actually meant to be used. Our upgraded version has executed config('database.default') to get sqlite, which means if the user later changes their default database connection to mysql, it won't work. You want to execute the function when the config is needed for the actual work, not when trying to upgrade.

How do we fix this? By going a little deeper. It's time to parse some code.

Working with ASTs

When transforming code files, you almost always want to parse the file, not load it. For instance, if ESLint executed your JavaScript files every time you ran a fix, you'd be in big trouble. Instead, the file is parsed into an abstract syntax tree (AST), which is then manipulated and written back to disk. This solves all three problems abov: the AST parses the entire code, so the code structure is preserved as you wrote it, including comments, syntax sugar, function calls, and maybe even whitespace.

If you're interested in seeing what code looks like as an AST, astexplorer.net is a useful way to easily do so. You can paste code in many different languages and see ASTs generated by different JS-based parsers. There's also phpast.com, which parses PHP using Nikita Popov's PHP-Parser (the library we'll use). Awesome

Here's what the AST for the original v2 config file looks like on phpast.com:

You can see that the entire code structure is captured; from the use statements at the start to each array item.

Alright, now we need to rethink our approach. Wherever we were referencing a config item's evaluated value, we need to go back and switch to using the value of its AST node.

First, we'll modify our loadConfigs() method so it returns the parsed code instead (and we'll rename it to parseConfigFiles()).

class Upgrader
{
    /** @var PhpParser\Node\Stmt[] */
    protected ?array $userOldConfigFileAst = [];
    /** @var PhpParser\Node\Stmt[] */
    protected ?array $sampleNewConfigFileAst = [];
    
    public function parseConfigFiles(): array
    {
        $userCurrentConfig = $this->getUserOldConfigFileAsAst();
        $incomingConfig = $this->getSampleNewConfigFileAsAst();

        return [$userCurrentConfig, $incomingConfig];
    }
    
    protected function getUserOldConfigFileAsAst(): ?array
    {
        if (!empty($this->userOldConfigFileAst)) {
            return $this->userOldConfigFileAst;
        }

        $sourceCode = file_get_contents($this->configFiles['user_old']);
        $parser = (new ParserFactory)->create(ParserFactory::PREFER_PHP7);
        $this->userOldConfigFileAst = $parser->parse($sourceCode);
        return $this->userOldConfigFileAst;
    }

    protected function getSampleNewConfigFileAsAst(): ?array
    {
        if (!empty($this->sampleNewConfigFileAst)) {
            return $this->sampleNewConfigFileAst;
        }

        $sourceCode = file_get_contents($this->configFiles['sample_new']);
        $parser = (new ParserFactory)->create(ParserFactory::PREFER_PHP7);
        $this->sampleNewConfigFileAst = $parser->parse($sourceCode);
        return $this->sampleNewConfigFileAst;
    }
}

Now we have our two config files, this time represented as ASTs.

With the PHP-Parser library, each syntax tree is an array of Stmt objects. For instance, for the example config file at the start of this article, we will have three items in this array: the two "use" statements, and the return statement. The return statement in turn has a property called expr, which holds an Array_ object, the actual config array. This array node has an items property, which is a list of the items in the array (ArrayItems[]). Note that, even if the array is a key-value associative array, items is still a list; each ArrayItem object has key and value properties.

(Don't forget: if you get lost regarding the AST structure, you can copy the config file's contents and drop it into phpast.com.)

So we update our fetchChanges() method to locate the return statement and get the items in the array:

    protected function fetchChanges(): void
    {
        [$userCurrentConfigFile, $sampleNewConfigFile] = $this->parseConfigFiles();

        $userCurrentConfigArray = Arr::first(
            $userCurrentConfigFile, fn(Node $node) => $node instanceof Stmt\Return_
        )->expr->items;
        $sampleNewConfigArray = Arr::first(
            $sampleNewConfigFile, fn(Node $node) => $node instanceof Stmt\Return_
        )->expr->items;
        $this->fetchAddedItems($userCurrentConfigArray, $sampleNewConfigArray);
        $this->fetchRemovedAndMovedItems($userCurrentConfigArray, $sampleNewConfigArray);
    }

Right. So now, we've got arrays of AST nodes. Now we need to update our change calculation methods so they work with these.

Hold on to your butts; things are about to get hairy!

Operating on AST nodes

Let's start with fetchAddedItems().

    protected function fetchAddedItems(
        array $userCurrentConfig, array $incomingConfig, string $rootKey = ''
    )
    {
        if ($this->arrayIsList($incomingConfig)) {
            // We're dealing with a list of items (numeric array)
            $diff = $this->subtractListFromOtherList($incomingConfig, $userCurrentConfig);
            foreach ($diff as $item) {
                $this->changes[] = [
                    'type' => self::CHANGE_LIST_ITEM_ADDED,
                    'key' => $rootKey,
                    'value' => $item['ast']->value,
                    'description' => "- '{$item['text']}' will be added to `$rootKey`.",
                ];
            }
            return;
        }

        foreach ($incomingConfig as $arrayItem) {
            $key = $arrayItem->key->value;
            $value = $arrayItem->value;

            $fullKey = $this->getFullKey($key, $rootKey);
            if ($this->shouldntTouch($fullKey)) {
                continue;
            }

            // Key is in new, but not in old
            if (!$this->hasItem($userCurrentConfig, $key)) {
                $this->changes[] = [
                    'type' => self::CHANGE_ADDED,
                    'key' => $fullKey,
                    'description' => "- `{$fullKey}` will be added.",
                    'value' => $value,
                ];
            } else {
                if ($this->expressionNodeIsArray($value)) {
                    // Key is in both old and new; recurse into array and compare the inner items
                    $this->fetchAddedItems(
                        $this->getItem($userCurrentConfig, $key)->value->items ?? null, $value->items, $fullKey
                    );
                }
            }

        }
    }

The code is mostly the same as before, but we've made some replacements:

  • array_diff() -> $this->subtractListFromOtherList()
  • in_array() -> $this->expressionNodeIsArray()
  • array_key_exists() -> $this->hasItem()
  • $userCurrentConfig[$key] -> $this->getItem($userCurrentConfig, $key)
  • array_is_list() -> $this->arrayIsList()

Why? The old methods expected regular PHP arrays of keys and values. But now what we have is an array of AST nodes (Expr objects), so we have to reimplement the functionality to work for that. Luckily, it's not too difficult:

    // Replaces array_is_list()
    protected function arrayIsList(array $arrayItems): bool
    {
        // If the array is a list, like ['a', 'b', 'c'], the items will have `key`s as null
        return isset($arrayItems[0]) && $arrayItems[0]->key === null;
    }
    
    // Replaces $array[$key]
    protected function getItem(array $arrayItems, string $key)
    {
        // Get the first item in the array with key matching the requested key
        return Arr::first(
            $arrayItems, fn(Expr\ArrayItem $node) => $node->key->value === $key
        );
    }

    // Replaces array_key_exists()
    protected function hasItem(array $arrayItems, string $key): bool
    {
        // Return true if the item exists; false otherwise
        return boolval($this->getItem($arrayItems, $key));
    }

    // Replaces is_array()
    protected function expressionNodeIsArray(Expr $expressionNode): bool
    {
        return $expressionNode instanceof Expr\Array_;
    }

Not bad! The last one (reimplementing array_diff) is a bit more difficult though. Comparing AST nodes is quite involved. They're objects, so we can't use ==. For now, we'll go with converting them to text and comparing the text versions. We can use PHP-Parser's pretty-printing functionality for that.

    /**
     * Get values in $list that are not in $otherList.
     * Replaces array_diff($list, $otherList)
     */
    protected function subtractOtherListFromList(array $list, array $otherList): array
    {
        $diff = [];
        $otherListWithItemsAsText = array_map(
            fn($item) => $this->convertAstExpressionToText($item), $otherList
        );
        foreach ($list as $item) {
            $itemAsText = $this->convertAstExpressionToText($item);
            if (!in_array($itemAsText, $otherListWithItemsAsText)) {
                $diff[] = ['ast' => $item, 'text' => $itemAsText];
            }
        }

        return $diff;
    }

    protected function convertAstExpressionToText($expression): string
    {
        $prettyPrinter = new PhpParser\PrettyPrinter\Standard;
        return $prettyPrinter->prettyPrintExpr($expression);
    }

It's not the most foolproof approach, but we'll go with it for now. An important detail here is that, in our $diff, we return both the AST version of the added item, and the text version. We'll print the text version for the user, but use the AST version for actual manipulation.

With that out of the way, our fetchRemovedAndMovedItems() is pretty easy; just take the old version and make the same replacements:

    protected function fetchRemovedAndMovedItems(
        array $userCurrentConfig, $incomingConfig, string $rootKey = ''
    )
    {
        if ($this->arrayIsList($incomingConfig)) {
            // A list of items (numeric array). We only add, not remove.
            return;
        }

        // Loop over the old config
        foreach ($userCurrentConfig as $arrayItem) {
            $key = $arrayItem->key->value;
            $value = $arrayItem->value;

            $fullKey = $this->getFullKey($key, $rootKey);

            // Key is in old, but was moved somewhere else in new
            if ($this->wasKeyMoved($fullKey)) {
                $this->changes[] = [
                    'type' => self::CHANGE_MOVED,
                    'key' => $fullKey,
                    'new_key' => $this->movedKeys[$fullKey],
                    'description' => "- `$fullKey` will be moved to `{$this->movedKeys[$fullKey]}`.",
                    'new_value' => $value,
                ];
                continue;
            }

            // Key is in old, but not in new
            if (!$this->hasItem($incomingConfig, $key)) {
                $this->changes[] = [
                    'type' => self::CHANGE_REMOVED,
                    'key' => $fullKey,
                    'description' => "- `$fullKey` will be removed.",
                ];
                continue;
            }

            if (!$this->shouldntTouch($fullKey) && $this->expressionNodeIsArray($value)) {
                // Key is in both old and new; recurse into array and compare the inner items
                $this->fetchRemovedAndMovedItems(
                    $value->items, $this->getItem($userCurrentConfig, $key)->value->items ?? null, $fullKey
                );
            }
        }
    }

Done! And now, when I run dryRun() again:

- `try_it_out` will be added.
- 'Strategies\QueryParameters\GetFromInlineValidator::class' will be added to `strategies.queryParameters`.
- 'Strategies\BodyParameters\GetFromInlineValidator::class' will be added to `strategies.bodyParameters`.
- `database_connections_to_transact` will be added.
- `interactive` will be moved to `try_it_out.enabled`.
- `continue_without_database_transactions` will be removed.

Yes, it works! Also, you can now see that the class constants are printed as they are in the sample config, rather than executing the ::class.

Manipulating the AST

We've got dry run updated; time to redo the actual upgrade. Here we go! Here's the new version of our applyChanges() function:

    protected function applyChanges(): array
    {
        $userConfigAst = $this->getUserOldConfigFileAsAst();
        $configArray =& Arr::first(
            $userConfigAst, fn(Node $node) => $node instanceof Stmt\Return_
        )->expr->items;

        foreach ($this->changes as $change) {
            switch ($change['type']) {
                case self::CHANGE_ADDED:
                    $this->addKey($configArray, $change['key'], $change['value']);
                    break;
                case self::CHANGE_REMOVED:
                    $this->deleteKey($configArray, $change['key']);
                    break;
                case self::CHANGE_MOVED:
                    // Move old value to new key
                    $this->setValue($configArray, $change['new_key'], $change['new_value']);
                    // Then delete old key
                    $this->deleteKey($configArray, $change['key']);
                    break;
                case self::CHANGE_LIST_ITEM_ADDED:
                    $this->pushItemOntoList($configArray, $change['key'], $change['value']);
                    break;
            }
        }

        return $userConfigAst;
    }

Okay, not bad. Like with fetchChanges(), we've replaced our uses of data_set() and unset() with custom methods, which we'll implement in a bit. Also, note the reference assignment (=&). We'll use that quite often to ensure that we're operating on the same array, rather than a copy. If we operate on a copy, our changes won't show up in the $userConfigAst.

Now let's implement the custom methods to manipulate the AST. Let's start with pushing an item onto a list.

    protected function pushItemOntoList(array &$arrayItems, string $listKey, $newValue)
    {
        $keySegments = explode('.', $listKey);
        $searchArray =& $arrayItems;
        while (count($keySegments)) {
            $nextKeySegment = array_shift($keySegments);
            foreach ($searchArray as $item) {
                if ($item->key->value === $nextKeySegment) {
                    $searchArray =& $item->value->items;
                    break;
                }
            }
        }

        $searchArray[] = new Expr\ArrayItem($newValue, null);
    }

This is actually simpler than it looks. Essentially, we're given the list of array items, and a key like strategies.queryParameters. We split the key into ['strategies', 'bodyParameters']. Then we search through the items in the array for strategies. When we find that, we update $searchArray to point to that array, then search that array for queryParameters, and so on. By the end, we've found the array we want, so we wrap the new value in an ArrayItem node and push it onto the array. And we've used references (&) at the right places, so our changes are persisted to the actual array, not a copy.

How about adding a new key to an array (like adding try_it_out)?

    protected function addKey(array &$arrayItems, string $key, $newValue)
    {
        $keySegments = explode('.', $key);
        $childKey = array_pop($keySegments);
        $searchArray =& $arrayItems;
        while (count($keySegments)) {
            $nextKeySegment = array_shift($keySegments);
            foreach ($searchArray as $item) {
                if ($item->key->value === $nextKeySegment) {
                    $searchArray =& $item->value->items;
                    break;
                }
            }
        }
        
        $keyNode = (new PhpParser\BuilderFactory)->val($childKey);
        $searchArray[] = new Expr\ArrayItem($newValue, $keyNode);
    }

You can see that the looping process is identical. The major difference here is that we first remove the last part of the key, the $childKey. This means, if we're trying to add an a.b.c key, we only iterate over ['a', 'b']. At the end, we add the item to the array using 'c' as the key. The val() function is a neat way of wrapping the raw string value ('c') as an AST node (docs).

Since the looping process is the same for both methods, we can extract that into a separate method:

    protected function findInnerArrayByKey(array &$arrayItems, array $keySegments, callable $callback)
    {
        $searchArray =& $arrayItems;
        while (count($keySegments)) {
            $nextKeySegment = array_shift($keySegments);
            foreach ($searchArray as $item) {
                if ($item->key->value === $nextKeySegment) {
                    $searchArray =& $item->value->items;
                    break;
                }
            }
        }

        $callback($searchArray);
    }
    
    protected function addKey(array &$arrayItems, string $key, $newValue)
    {
        $keySegments = explode('.', $key);
        $childKey = array_pop($keySegments);
        $this->findInnerArrayByKey($arrayItems, $keySegments, function (array &$searchArray) use ($childKey, $newValue) {
            $keyNode = (new BuilderFactory)->val($childKey);
            $searchArray[] = new Expr\ArrayItem($newValue, $keyNode);
        });
    }

    protected function pushItemOntoList(array &$arrayItems, string $listKey, $newValue)
    {
        $keySegments = explode('.', $listKey);
        $this->findInnerArrayByKey($arrayItems, $keySegments, function (array &$list) use ($newValue) {
            $list[] = new Expr\ArrayItem($newValue, null);
        });
    }

And then we can implement the remaining two methods:

    protected function setValue(array &$arrayItems, string $key, $newValue)
    {
        $keySegments = explode('.', $key);
        $childKey = array_pop($keySegments);
        // Find the inner array, then find the item we're updating and update its value
        $this->findInnerArrayByKey($arrayItems, $keySegments, function (array $searchArray) use ($childKey, $newValue) {
            $item = Arr::first(
                $searchArray, fn(Expr\ArrayItem $node) => $node->key->value === $childKey
            );
            $item->value = $newValue;
        });
    }

    protected function deleteKey(array &$arrayItems, string $key)
    {
        $keySegments = explode('.', $key);
        $childKey = array_pop($keySegments);
        // Find the inner array, then find the item we're deleting and remove it
        $this->findInnerArrayByKey($arrayItems, $keySegments, function (array &$searchArray) use ($childKey) {
            foreach ($searchArray as $index => $item) {
                if ($item->key->value === $childKey) {
                    unset($searchArray[$index]);
                }
            }
        });
    }

Yes!! We've done it.🎉🎉 I can't show the fully expanded AST, but here's an overview of what it is when I dump it.

Dumped modified AST

Awesome. One more thing: write the config to the file.

Interlude: A note on fallibility

An important part of building "smart" software is acknowledging its fallibility—it will get things wrong. It should be easy for users to fix such things. This principle led to me some design decisions (which I'd learnt from others):

  • There must be a --dry-run option
  • We'll print a warning for any changes we aren't 100% sure about (like database_connections_to_transact)
  • Before we overwrite the config file, we save the old one to a backup, and we communicate this with the user

Software can (and will) make mistakes, and users should not be punished for that.

The final task: writing

This is actully the simplest part (sort of).😅 We've already seen how we can use PHP-Parser's "pretty print" to convert AST nodes to text, so we go from there:

    // Print out the changes into the user's config file (saving the old one as a backup)
    protected function writeNewConfigFile(array $ast)
    {
        $prettyPrinter = new PrettyPrinter\Standard(['shortArraySyntax' => true]);
        $astAsText = $prettyPrinter->prettyPrintFile($ast);

        $userConfigFile = $this->configFiles['user_old'];
        rename($userConfigFile, "$userConfigFile.bak");
        file_put_contents($userConfigFile, $astAsText);
    }

    public function upgrade()
    {
        $this->fetchChanges();
        $upgradedConfig = $this->applyChanges();
        $this->writeNewConfigFile($upgradedConfig);
    }

Here's the printed config file (with no modifications or custom formatting):

You can see the removed keys are gone, database_connections_to_transact is added without being executed, and try_it_out.enabled is set to the old value of interactive. Yay! (The new items were also added to strategies.queryParameters and strategies.bodyParameters, but it's a bit difficult to see them, since PHP-Parser prints the entire array on one line.)

And beyond...

We've achieved the goal we came for, so we'll stop here for now. But there's still more to be done! Our current solution has quite a few limitations I didn't touch on:

  • We need to update our tests to use the new AST approach, as well as add tests for the actual upgrade. Many more tests.
  • The printed file isn't ideal (I don't like the one-line arrays, and some other formatting issues). We need to improve the printed output.
  • We didn't consider namespace resolution when comparing for equality. For example, the old config file might have A\B::class, while the new sample has use A\B; at the top and B::class later. With our current implementation, we'd pick up A\B::class and B::class as different items.
  • Our approach doesn't help with putting items in a specific order. For instead, maybe we wanted try_it_out to be second? Right now, all added items just go to the end of the array. There's an alternative approach that fixes this (start with the new sample and copy over the user's old values), but it's more tasking, and has its own challenges.
  • Thus far, we only attempted to upgrade the config file, but we could go further and use this similar AST-manipulation approach to upgrade regular code files that use our library's API. For instance, if a parameter's type was changed from array to object, we could go through the user's codebase and change any usages of that parameter to treat it as an object.

I hope to write another article to tackle these soon, but I've already addressed a few of them (like pretty-printing and namespace resolution) in the full codebase, available on GitHub.

More resources

If you're interested in automated code transformations like this, here are some more tools you can check out:

  • Rector, for transforming PHP files
  • jscodeshift and recast, for transforming JS files
  • ts-morph, which uses TypeScript's compiler to manipulate TS/JS files
  • unified, a family of JS tools for parsing and manipulating code in different languages and even natural language

Hey👋. I write about interesting software engineering challenges. Want to get updated when I publish new posts? Just visit tntcl.app/blog.shalvah.me.

(Confession: I built Tentacle.✋ It helps you keep a clean inbox by combining your favourite blogs into one weekly newsletter.)

OTHER POSTS

Powered By Swish