Alexander Sosna: How we execute PG major upgrades at GitLab, with zero downtime. (PGConf.EU 2023)

Testing Automation

GitLab's approach to executing PostgreSQL major upgrades with zero downtime, using a two-cluster strategy, incremental upgrades, and custom solutions for sequences and logical replication.

Key takeaways

To perform a PostgreSQL major upgrade with zero downtime, GitLab uses a two-cluster approach, with a target cluster upgraded to the new version and a standby cluster following behind.
To ensure minimal impact on users, the upgrade is performed during off-peak hours, and the target cluster is upgraded incrementally, with Physiological replication used to synchronize data.
The upgrade process involves creating a new target cluster, streaming data from the source cluster to the new target cluster, and then switching the production load to the new target cluster.
To handle sequences, which are not replicated by PostgreSQL, GitLab uses a custom solution involving a sequence number generator and a logical replication slot.
The upgrade process is automated using Chef, a configuration management tool, which ensures that machines are provisioned and configured correctly.
The team also uses Rsync to transfer data between clusters, and rzinc to synchronize the two clusters.
Sequences are critical for PostgreSQL, and their replication is not supported. GitLab uses a custom solution to replicate sequences.
Logical replication is complex and requires careful testing and validation.
Schema changes are also handled automatically using Chef.
The upgrade process involves several steps, including creating a new target cluster, streaming data, and switching the production load. Each step is carefully tested and validated.
The team uses a heavy testing approach, including regression testing and QA testing, to ensure that the application works correctly after the upgrade.
They also use a benchmarking environment to test the upgrade and ensure that it meets performance requirements.
The upgrade process is designed to have zero user impact, with all data replicated and available on the new cluster before switching the production load.
The team uses a YOLO (You Only Live Once) approach to testing and validation, ensuring that the upgrade is thoroughly tested before deployment.
The upgrade process is monitored and optimized continuously, with improvements made based on feedback from users and performance metrics.

Alexander Sosna: How we execute PG major upgrades at GitLab, with zero downtime. (PGConf.EU 2023)

More talks