Optimizing OneToMany Doctrine collections

In this article, I’m going to show you a method of reducing the number of SQL queries down to a single one when fetching child entity collections of a hierarchical structure with PHP and Doctrine ORM. Check out the full example in the dedicated repository. The article is based on the StackOverflow thread.

Optimizing OneToMany Doctrine collections

Many of us have had a change to work with hierarchical structures. For instance, product categories. There are a few ways to store such kinds of structures in a database optimized for the particular use cases. On one of the projects which I was involved in, we used a combined method, adjacency list + materialized path. So keep in mind that the solution described in the article depends on this kind of relationship between entities.

Database

I suppose that any RDBMS supported by Doctrine DBAL / ORM should work well. We’re going to stick with PostgreSQL. Say, categories are stored this way:

create sequence public.category_id_seq
    increment by 1;

create table if not exists public.category
(
    id         bigint default nextval('category_id_seq'::regclass) not null primary key,
    parent_id  bigint constraint fk_category_id references public.category,
    uid        text not null,
    name       text not null,
    created_at timestamp(0) with time zone not null,
    updated_at timestamp(0) with time zone not null
);

create unique index uniq_category_uid
    on public.category (uid);

Hierarchical structure is implemented by the table’s self-reference using the parent_id foreign key. At the same time, we can store category full paths using the unique uid column where nesting levels are separated by the underscore. It means that foo_bar is a second level category.

Entity

Now let’s write a bare minimum of our entity:

<?php
declare(strict_types=1);

namespace Hierarchy;

use DateTimeImmutable;
use Doctrine\Common\Collections\ArrayCollection;
use Doctrine\Common\Collections\Collection;
use Doctrine\ORM\Mapping\Column;
use Doctrine\ORM\Mapping\Entity;
use Doctrine\ORM\Mapping\GeneratedValue;
use Doctrine\ORM\Mapping\Id;
use Doctrine\ORM\Mapping\JoinColumn;
use Doctrine\ORM\Mapping\ManyToOne;
use Doctrine\ORM\Mapping\OneToMany;
use Doctrine\ORM\Mapping\Table;

#[Entity, Table('category')]
class Category
{
    #[Column(name: 'parent_id', type: 'bigint', nullable: true)]
    private ?int $parentId;
    
    #[Column(name: 'uid', type: 'string', nullable: false)]
    private string $uid;

    #[OneToMany(mappedBy: 'parent', targetEntity: self::class)]
    private Collection $children;

    #[Column(name: 'created_at', type: 'datetimetz_immutable', nullable: false)]
    private DateTimeImmutable $createdAt;

    #[Column(name: 'updated_at', type: 'datetimetz_immutable', nullable: false)]
    private DateTimeImmutable $updatedAt;

    public function __construct(
        #[Id]
        #[Column(name: 'id', type: 'bigint', nullable: false)]
        #[GeneratedValue(strategy: 'NONE')]
        private int $id,
        string $uid,
        #[Column(name: 'name', type: 'string', nullable: false)]
        private string $name,
        #[ManyToOne(targetEntity: self::class, inversedBy: 'children')]
        #[JoinColumn(name: 'parent_id', referencedColumnName: 'id')]
        private ?self $parent = null
    ) {
        $this->parentId = $this->parent?->id;
        $this->uid = $this->parent === null ? $uid : sprintf('%s_%s', $this->parent->uid, $uid);
        $this->createdAt = new DateTimeImmutable();
        $this->updatedAt = new DateTimeImmutable();
        $this->children = new ArrayCollection();
    }

    public function id(): int
    {
        return $this->id;
    }

    public function uid(): string
    {
        return $this->uid;
    }

    public function name(): string
    {
        return $this->name;
    }

    public function parent(): ?self
    {
        return $this->parent;
    }

    public function parentId(): ?int
    {
        return $this->parentId;
    }
    
    public function children(): Collection
    {
        return new ArrayCollection($this->children->toArray());
    }
}

ManyToOne relations such as parent in our example are usually established by a single property. By default, Doctrine makes an extra query to fetch the related entity into memory on the first reference to the property. On the other hand, if fetch: EAGER is configured, the related entity would be fetched along with the main one. This would lead to cascade loading of a sub-tree in our case, because the Category entity is self-referenced. From my perspective, this side effect is quite undesirable due to high memory consumption for the data that may not be needed at all.

So when we don’t need the whole related entity but just its identifier (parent category ID in our case) we can use a separate parentId property. Initialized in the entity’s constructor, the actual value will always be at our hand. Then, using it over and over again, we won’t trigger any extra SQL queries against the database. This is going to be useful for the optimizations later.

Repository

Let’s define a repository interface which we will use in testing to switch implementations.

<?php
declare(strict_types=1);

namespace Hierarchy;

interface CategoryRepositoryInterface
{
    /**
     * Returns a new category ID.
     *
     * Factory depends on the implementation.
     */
    public function nextCategoryId(): int;

    /**
     * Returns descendants of all nesting levels by the given category UID.
     *
     * @return Category[]
     */
    public function findDescendantsByUid(string $uid): array;
}

These two methods are going to be enough for testing purposes. A default implementation could look like this:

<?php
declare(strict_types=1);

namespace Hierarchy;

use Doctrine\ORM\EntityManagerInterface;

final readonly class DefaultCategoryRepository implements CategoryRepositoryInterface
{
    public function __construct(private EntityManagerInterface $em)
    {
    }

    public function nextCategoryId(): int
    {
        return (int) $this->em
            ->getConnection()
            ->executeQuery("select nextval('category_id_seq')")
            ->fetchOne();
    }

    /**
     * @return Category[]
     */
    public function findDescendantsByUid(string $uid): array
    {
        return $this->em
            ->createQueryBuilder()
            ->select('c')
            ->from(Category::class, 'c')
            ->where('c.uid LIKE :uid')
            ->setParameter('uid', sprintf('%s_%%', $uid))
            ->orderBy('c.uid')
            ->getQuery()
            ->getResult();
    }
}

The nextCategoryId method returns the next value of the PostgreSQL sequence that we created in the very beginning. We’re going to use this method to pass the identifier to the Category entity’s constructor before persisting the data. The findDescendantsByUid method returns an array of descendants of all levels for the category with the given UID.

Testing

We’re going to use PHPUnit to check the number of SQL queries before and after the fix. The configuration is as simple as follows:

<?xml version="1.0" encoding="UTF-8"?>

<!-- https://phpunit.readthedocs.io/en/latest/configuration.html -->
<phpunit xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:noNamespaceSchemaLocation="vendor/phpunit/phpunit/phpunit.xsd"
         colors="true"
         bootstrap="vendor/autoload.php"
>
    <php>
        <ini name="display_errors" value="1" />
        <ini name="error_reporting" value="-1" />
        <server name="DB_HOST" value="postgres" />
        <server name="DB_PORT" value="5432" />
        <server name="DB_USER" value="user" />
        <server name="DB_PASSWORD" value="user" />
        <server name="DB_NAME" value="hierarchy_article_test" />
        <server name="DB_DRIVER" value="pdo_pgsql" />
        <server name="DOCTRINE_SRC_DIR" value="./src" />
        <server name="DOCTRINE_PROXY_DIR" value="./proxy" />
        <server name="DOCTRINE_PROXY_NAMESPACE" value="HierarchyProxy\\" />
    </php>

    <testsuites>
        <testsuite name="Project Test Suite">
            <directory>tests</directory>
        </testsuite>
    </testsuites>
</phpunit>

Now let’s create a new test class and the first method called testBehavior. We will leave it blank for now but later it will contain the testing logic.

<?php
declare(strict_types=1);

use PHPUnit\Framework\TestCase;

final class CategoryRepositoryTest extends TestCase
{
    public function testBehavior(): void
    {
        // TODO: write test
    }
}

We need to prepare the test data first, be it a small category tree.

private static function createCategories(
    EntityManagerInterface $em,
    CategoryRepositoryInterface $repository
): void {
    $parent = new Category($repository->nextCategoryId(), 'foo', 'Foo');
    $child = new Category($repository->nextCategoryId(), 'bar', 'Bar', $parent);
    $grandChild = new Category($repository->nextCategoryId(), 'qux', 'Qux', $child);
    $grandGrandChild = new Category($repository->nextCategoryId(), 'doo', 'Doo', $grandChild);

    foreach ([$parent, $child, $grandChild, $grandGrandChild] as $entity) {
        $em->persist($entity);
    }

    $em->flush();
    $em->clear();
}

Next, we implement methods to create an instance of Entity Manager and database Connection.

private static function createEntityManager(Connection $connection): EntityManagerInterface
{
    $config = new ORMConfiguration();
    $config->setMetadataDriverImpl(new AttributeDriver([$_SERVER['DOCTRINE_SRC_DIR']]));
    $config->setProxyDir($_SERVER['DOCTRINE_PROXY_DIR']);
    $config->setProxyNamespace($_SERVER['DOCTRINE_PROXY_NAMESPACE']);

    return new EntityManager($connection, $config);
}

private static function createConnection(LoggerInterface $logger): Connection
{
    $params = [
        'host' => $_SERVER['DB_HOST'],
        'port' => $_SERVER['DB_PORT'],
        'user' => $_SERVER['DB_USER'],
        'password' => $_SERVER['DB_PASSWORD'],
        'dbname' => $_SERVER['DB_NAME'],
        'driver' => $_SERVER['DB_DRIVER']
    ];

    $config = new DBALConfiguration();
    $config->setMiddlewares([
        new \Doctrine\DBAL\Logging\Middleware($logger)
    ]);

    return DriverManager::getConnection($params, $config);
}

The goal of the test is to check the SQL queries issued by Doctrine. So we need a logger. I hope that you update your project dependencies in time and what I’m going to say is something obvious to you. Anyway, I’d like to stress your attention on the configuration of the database connection. Doctrine DBAL 3.2.0 introduced middlewares, deprecating Configuration::setSQLLogger() and the Doctrine\DBAL\Logging\DebugStack class which started generating warnings. Before, you could write something like this:

$debugStack = new DebugStack();
$em->getConnection()->getConfiguration()->setSQLLogger($debugStack);
// issue and study queries
$debugStack->queries;

But now it should be a middleware like in the example above. As there’s no alternative to DebugStack out of the box anymore, we need to implement it ourselves. Fortunately, it’s not that hard:

<?php
declare(strict_types=1);

namespace Hierarchy\Tests;

use Psr\Log\AbstractLogger;
use Stringable;

final class TestDatabaseLogger extends AbstractLogger
{
    private array $queries = [];
    
    public function log($level, Stringable|string $message, array $context = []): void
    {
        if (array_key_exists('sql', $context)) {
            $this->queries[] = $context;
        }
    }

    /**
     * @return string[]
     */
    public function selectQueries(): array
    {
        $filter = static fn (string $query) => str_starts_with($query, 'SELECT');
        
        return array_values(array_filter(array_column($this->queries, 'sql'), $filter));
    }
}

Let’s put the pieces together in the setUpTestData() method:

#[ArrayShape([TestDatabaseLogger::class, CategoryRepositoryInterface::class])]
private static function setUpTestData(string $repositoryClassName): array
{
    $logger = new TestDatabaseLogger();

    $connection = self::createConnection($logger);
    $connection->executeStatement('delete from category');

    $em = self::createEntityManager($connection);
    $repository = new $repositoryClassName($em);

    self::createCategories($em, $repository);

    return [$logger, $repository];
}

Thanks to $repositoryClassName we don't need to “hardcode” a specific repository class. This way we will be able to test both implementation, basic and optimized, without even touching the test and the code preparing the test data. Here is the test:

/**
 * @dataProvider behaviourDataProvider
 */
public function testBehavior(string $repositoryClassName, array $expectedQueries): void
{
    [$logger, $repository] = self::setUpTestData($repositoryClassName);

    // foo_bar, foo_bar_qux, foo_bar_qux_doo
    $descendants = $repository->findDescendantsByUid('foo');
    
    self::assertEquals('foo_bar_qux_doo', $descendants[0]->children()->first()->children()->first()->uid());
    self::assertEquals($expectedQueries, $logger->selectQueries());
}

public function behaviourDataProvider(): array
{
    return [
        [
            DefaultCategoryRepository::class,
            [
                'SELECT c0_.parent_id AS parent_id_0, c0_.uid AS uid_1, c0_.created_at AS created_at_2, c0_.updated_at AS updated_at_3, c0_.id AS id_4, c0_.name AS name_5, c0_.parent_id AS parent_id_6 FROM category c0_ WHERE c0_.uid LIKE ? ORDER BY c0_.uid ASC',
                'SELECT t0.parent_id AS parent_id_1, t0.uid AS uid_2, t0.created_at AS created_at_3, t0.updated_at AS updated_at_4, t0.id AS id_5, t0.name AS name_6, t0.parent_id AS parent_id_7 FROM category t0 WHERE t0.parent_id = ?',
                'SELECT t0.parent_id AS parent_id_1, t0.uid AS uid_2, t0.created_at AS created_at_3, t0.updated_at AS updated_at_4, t0.id AS id_5, t0.name AS name_6, t0.parent_id AS parent_id_7 FROM category t0 WHERE t0.parent_id = ?'
            ]
        ]
    ];
}

Now ensure it completes successfully:

PHPUnit 9.5.27 by Sebastian Bergmann and contributors.

.                    1 / 1 (100%)

Time: 00:00.093, Memory: 10.00 MB

OK (1 test, 2 assertions)
Process finished with exit code 0

The core logic of the test is:

  1. Fetch all descendants of foo.
  2. Take the foo_bar entity, move deep into the sub-tree by calling children() multiple times and ensure that every such call issues an extra SQL query.

It turns out that Doctrine fetches OneToMany collection from the database every time we reference the configured property despite the fact that those entities have already been loaded into memory. This is what we are going to fix now.

Optimization

Let’s create another repository called ImprovedCategoryRepository with the following contents:

<?php
declare(strict_types=1);

namespace Hierarchy;

use Doctrine\ORM\EntityManagerInterface;
use Doctrine\ORM\PersistentCollection;

final readonly class ImprovedCategoryRepository implements CategoryRepositoryInterface
{
    public function __construct(private EntityManagerInterface $em)
    {
    }

    public function nextCategoryId(): int
    {
        return (int) $this->em
            ->getConnection()
            ->executeQuery("select nextval('category_id_seq')")
            ->fetchOne();
    }

    /**
     * @return Category[]
     */
    public function findDescendantsByUid(string $uid): array
    {
        $result = $this->em
            ->createQueryBuilder()
            ->select('c')
            ->from(Category::class, 'c')
            ->indexBy('c', 'c.id')
            ->where('c.uid LIKE :uid')
            ->setParameter('uid', sprintf('%s_%%', $uid))
            ->orderBy('c.uid')
            ->getQuery()
            ->getResult();

        return $this->setUpCategoryRelations($result);
    }

    /**
     * @param Category[] $categories
     * @return Category[]
     */
    private function setUpCategoryRelations(array $categories): array
    {
        $metadata = $this->em->getClassMetadata(Category::class);
        $idField = $metadata->reflFields['id'];
        $parentField = $metadata->reflFields['parent'];
        $parentIdField = $metadata->reflFields['parentId'];
        $childrenField = $metadata->reflFields['children'];

        foreach ($categories as $category) {
            /** @var PersistentCollection $children */
            $children = $childrenField->getValue($category);
            $children->setInitialized(true);

            $parent = $categories[$parentIdField->getValue($category)] ?? null;

            if ($parent === null) {
                continue;
            }

            /** @var PersistentCollection $children */
            $children = $childrenField->getValue($parent);

            if (!$children->contains($category)) {
                $parentField->setValue($category, $parent);
                $parentIdField->setValue($category, $idField->getValue($parent));
                $children->add($category);
            }
        }

        return array_values($categories);
    }
}


In terms of DQL, the only difference to the DefaultCategoryRepository implementation is that we tell Doctrine to index the results by the category ID. We need this later during traversal of the collection:

$this->em
     ->createQueryBuilder()
     ->select('c')
     ->from(Category::class, 'c')
     ->indexBy('c', 'c.id')

The most interesting stuff lies in the setUpCategoryRelations() method. Using metadata, kindly provided by Doctrine, we take the properties of the Category entity we’re going to work with:

$metadata = $this->em->getClassMetadata(Category::class);
$idField = $metadata->reflFields['id'];
$parentField = $metadata->reflFields['parent'];
$parentIdField = $metadata->reflFields['parentId'];
$childrenField = $metadata->reflFields['children'];

Then we trick Doctrine by saying that all collections of the child entities have been initialized:

/** @var PersistentCollection $children */
$children = $childrenField->getValue($category);
$children->setInitialized(true);

From now on, any call to Category::$children won’t cause Doctrine to trigger an extra database query! But it’s only one part of our fix because the collections are still empty. We need to fill them them by hand:

$parent = $categories[$parentIdField->getValue($category)] ?? null;

if ($parent === null) {
    continue;
}

/** @var PersistentCollection $children */
$children = $childrenField->getValue($parent);

if (!$children->contains($category)) {
    $parentField->setValue($category, $parent);
    $parentIdField->setValue($category, $idField->getValue($parent));
    $children->add($category);
}

Reflection makes the code look a bit ugly. But it has some advantages:

  1. Technical manipulations over entities are clearly separated from business logic.
  2. No change to entity’s methods will break the repository.

In fact, the code above does the same as if we wrote the following in our Category entity and then used that in the repository:

public function addChild(self $child): void
{
    if (!$this->children->contains($child)) {
        $child->parent = $this;
        $child->parentId = $this->id;
        $this->children->add($child);
    }
}

Testing once again

All we need to do to test the optimized implementation of our CategoryRepositoryInterface is just add another dataset to the data provider CategoryRepositoryTest::behaviorDataProvider():

public function behaviourDataProvider(): array
{
    return [
        [
            DefaultCategoryRepository::class,
            [
                'SELECT c0_.parent_id AS parent_id_0, c0_.uid AS uid_1, c0_.created_at AS created_at_2, c0_.updated_at AS updated_at_3, c0_.id AS id_4, c0_.name AS name_5, c0_.parent_id AS parent_id_6 FROM category c0_ WHERE c0_.uid LIKE ? ORDER BY c0_.uid ASC',
                'SELECT t0.parent_id AS parent_id_1, t0.uid AS uid_2, t0.created_at AS created_at_3, t0.updated_at AS updated_at_4, t0.id AS id_5, t0.name AS name_6, t0.parent_id AS parent_id_7 FROM category t0 WHERE t0.parent_id = ?',
                'SELECT t0.parent_id AS parent_id_1, t0.uid AS uid_2, t0.created_at AS created_at_3, t0.updated_at AS updated_at_4, t0.id AS id_5, t0.name AS name_6, t0.parent_id AS parent_id_7 FROM category t0 WHERE t0.parent_id = ?'
            ]
        ],
        [
            ImprovedCategoryRepository::class,
            [
                'SELECT c0_.parent_id AS parent_id_0, c0_.uid AS uid_1, c0_.created_at AS created_at_2, c0_.updated_at AS updated_at_3, c0_.id AS id_4, c0_.name AS name_5, c0_.parent_id AS parent_id_6 FROM category c0_ WHERE c0_.uid LIKE ? ORDER BY c0_.uid ASC',
            ]
        ]
    ];
}

So, with this data set we expect that once we have fetched descendants with ImprovedCategoryRepository::findDescendantsByUid() any subsequent calls for nested collections will not trigger extra SQL queries against the database. Let’s verify this by running the tests once again:

PHPUnit 9.5.27 by Sebastian Bergmann and contributors.

..                   2 / 2 (100%)

Time: 00:00.100, Memory: 10.00 MB

OK (2 tests, 4 assertions)
Process finished with exit code 0

Voilà, the fix does work indeed! Our tree was small, I agree. But the effect may be much more noticeable on the large trees once hundreds or thousands of queries get reduced to just one. That’s been tested in practice. Hopefully, the article was helpful to you.