Making changes to large databases in Symfony projects usually requires a command. Keeping it free of "Out of memory" errors takes some work.

Let's have a look at some of the things we can do to keep memory usage under control.

Batching

Code needs to be performant and work with large amounts of data. Having both requires some non-trivial code. If you've been working on migrations for a while, you might have gone through both of these naive approaches — I sure have:

  • select all of the data at once and migrate it in one run: the network is only accessed once but this inevitably results in an "Out of memory" error because there is probably too much data for your server's RAM,
  • run the migration on each row, selecting rows one by one: memory usage is low because we only ever load one row into to memory at a time, but the network is accessed as many times as there are rows, which is very slow.

The approach for low memory usage and high performance? Batching.

With batching, you select rows 100 by 100 (or 1000 by 100, or whatever fits your use case) and run the migration on each batch. This results in predictable memory usage while reducing network access.

It might look something like this in PHP:

$batchSize = 100;

// We don't want to select the same batch multiple times. We could
// use an SQL "OFFSET" clause for this. But "OFFSET" tends to be slow
// on large tables. A faster approach is to select rows ordered by ID
// and to set a minimum ID with a "WHERE" clause. Something like:
//
//     SELECT * FROM user WHERE id > :minimumId ORDER BY id ASC
//
// This achieves the effect of an "OFFSET" but is significantly faster.
$minimumId = 0;

do {
    $batch = getBatchOfUsersOrderedById($batchSize);
    $batchCount = count($batch);

    $minimumId = $batch[$batchCount - 1]->getId();

    if ($batchCount > 0) {
        foreach ($batch as $user) {
            // Do something foreach user
            // ...
        }
    }

// If the batch is smaller than the maximum batch size, it means we have
// gone over all of the users and we are done
} while ($batchCount === $batchSize);

Putting the main loop's body inside a separate method

PHP automatically clears memory of variables that go out of scope, somewhat like what C does with its stack variables, if you're familiar with that. Basically, any variable that is created in a function is cleared once the function ends.

What you can do to allow PHP to clear more memory is to put the main loop's content inside a separate method. It looks something like this:

$batchSize = 100;
$minimumId = 0;

do {
    list($batchCount, $minimumId) = $this->runOneBatch($batchSize, $minimumId);
} while ($batchCount === $batchSize);

And in a separate method named runOneBatch:

function runOneBatch(int $batchSize, int $minimumId): array {
    $batch = getBatchOfUsersOrderedById($batchSize);
    $batchCount = count($batch);

    $minimumId = $batch[$batchCount - 1]->getId();

    if ($batchCount > 0) {
        foreach ($batch as $user) {
            // Do something foreach user
            // ...
        }
    }

    return [$batchCount, $minimumId];
}

Marking unused variables as collectable by garbage collection

In PHP, variables are said to be garbage collected, which means that you usually don't have to manage the memory they use manually. Instead, PHP handles memory management automatically.

This works well most of the time. But for long-running tasks, it has its limits. Because PHP does not always know that a variable has become useless, it keeps it in memory just in case. This increases memory usage.

To tell PHP that a variable has become useless, you can set it to null and unset() it:

$users = loadLotsOfUsersFromMysql();

// ... do some stuff with the users ...

$users = null;
unset($users);

Manual garbage collection cycle

PHP's garbage collector uses heuristics to figure out when the best time is to remove unused variables from memory. In long-running tasks, you may want to force PHP to remove unused variables to keep memory usage low.

Running garbage collection manually in PHP goes like this:

gc_collect_cycles();

Clearing Doctrine's entity manager

Internally, Doctrine, Symfony's ORM, keeps a reference of each entity it has loaded into memory, unless you explicitly tell it to drop the reference.

To keep memory usage down, you may want to clear the entity manager, which will remove references to all entities currently known to Doctrine:

/** @var \Doctrine\ORM\EntityManagerInterface $em */
$em = $this->getContainer()->get('doctrine')->getManager();
$em->clear();

SQL logging

There is one feature of Doctrine that is enabled in Symfony by default and can be a source of memory leaks during long-lived commands: the SQL logger.

You can disable the SQL logger with something like this:

/** @var \Doctrine\DBAL\Connection $connection */
$connection = $this->getContainer()->get('doctrine')->getManager()->getConnection();
$connection->getConfiguration()->setSQLLogger(null);

Putting it all together

Here's an example of what a memory efficient long running migration script might look like when you put all of the above tips together:

<?php

namespace App\Command;

use App\Entity\User;
use App\Repository\User as UserRepository;
use Doctrine\ORM\EntityManagerInterface;
use Symfony\Bundle\FrameworkBundle\Command\ContainerAwareCommand;
use Symfony\Component\Console\Input\InputArgument;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Input\InputOption;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Style\SymfonyStyle;

class MigrateUsersToNewDatabaseFormatCommand extends ContainerAwareCommand
{
    protected function configure()
    {
        $this
            ->setName('app:migrate-users')
            ->setDescription('Migrate users to new database format')
        ;
    }

    protected function execute(InputInterface $input, OutputInterface $output)
    {
        $style = new SymfonyStyle($input, $output);

        /** @var \Doctrine\ORM\EntityManagerInterface $em */
        $em = $this->getContainer()->get('doctrine')->getManager();

        /** @var \Doctrine\DBAL\Connection $connection */
        $connection = $em->getConnection();
        $connection->getConfiguration()->setSQLLogger(null);

        /** @var \App\Repository\User $users */
        $users = $em->getRepository(User::class);

        // There are a lot of rows to migrate, so we migrate them little
        // by little to use less RAM. You can tweak this depending on
        // what you find works best for your use case.
        $batchSize = 100;

        // The fastest way to mimic an OFFSET clause is to store the last
        // migrated User.id and to select rows above that ID instead of
        // using an actual OFFSET.
        $minimumId = 0;

        do {
            list($batchCount, $minimumId) = $this->runOneBatch($em, $users, $style, $batchSize, $minimumId);
        } while ($batchCount === $batchSize);
    }

    function runOneBatch(EntityManagerInterface $em, UserRepository $users, SymfonyStyle $style, int $batchSize, int $minimumId): array {
        $batch = $users->getBatchOfUsersOrderedById($batchSize);
        $batchCount = count($batch);

        if ($batchCount > 0) {
            $minimumId = $batch[$batchCount - 1]->getId();

            /** @var User $user */
            foreach ($batch as $user) {
                try {
                    // Do some stuff with the current user, for instance
                    // set a newly added field to "false"
                    $user->setAccessRestricted(false);
                } catch (\Exception $e) {
                    // Handle the exception if needed
                    // ...

                    // Once the exception is handled, it is no longer needed
                    // so we can mark it as useless and garbage-collectable
                    $e = null;
                    unset($e);
                }

                $em->persist($user);

                // We are done updating this user, so we mark it as unused,
                // this way PHP can remove it from memory
                $user = null;
                unset($user);
            }

            // For each batch of users, we display the memory usage in MB so
            // that we can see if it grows during testing: if it does grow,
            // there is most likely a memory leak somewhere
            $memoryUsage = memory_get_usage(true) / 1024 / 1024;
            $style->text("Batch finished with memory: ${memoryUsage}M");

            // Once the batch of users is updated, we don't need it anymore
            // so we mark it as garbage-collectable
            $batch = null;
            unset($batch);

            // Flushing and then clearing Doctrine's entity manager allows
            // for more memory to be released by PHP
            $em->flush();
            $em->clear();

            // Just in case PHP would choose not to run garbage collection,
            // we run it manually at the end of each batch so that memory is
            // regularly released
            gc_collect_cycles();
        }

        return [$batchCount, $minimumId];
    }

}