Alexander Sosna: How we execute PG major upgrades at GitLab, with zero downtime. (PGConf.EU 2023)

GitLab's approach to executing PostgreSQL major upgrades with zero downtime, using a two-cluster strategy, incremental upgrades, and custom solutions for sequences and logical replication.

Key takeaways
  • To perform a PostgreSQL major upgrade with zero downtime, GitLab uses a two-cluster approach, with a target cluster upgraded to the new version and a standby cluster following behind.
  • To ensure minimal impact on users, the upgrade is performed during off-peak hours, and the target cluster is upgraded incrementally, with Physiological replication used to synchronize data.
  • The upgrade process involves creating a new target cluster, streaming data from the source cluster to the new target cluster, and then switching the production load to the new target cluster.
  • To handle sequences, which are not replicated by PostgreSQL, GitLab uses a custom solution involving a sequence number generator and a logical replication slot.
  • The upgrade process is automated using Chef, a configuration management tool, which ensures that machines are provisioned and configured correctly.
  • The team also uses Rsync to transfer data between clusters, and rzinc to synchronize the two clusters.
  • Sequences are critical for PostgreSQL, and their replication is not supported. GitLab uses a custom solution to replicate sequences.
  • Logical replication is complex and requires careful testing and validation.
  • Schema changes are also handled automatically using Chef.
  • The upgrade process involves several steps, including creating a new target cluster, streaming data, and switching the production load. Each step is carefully tested and validated.
  • The team uses a heavy testing approach, including regression testing and QA testing, to ensure that the application works correctly after the upgrade.
  • They also use a benchmarking environment to test the upgrade and ensure that it meets performance requirements.
  • The upgrade process is designed to have zero user impact, with all data replicated and available on the new cluster before switching the production load.
  • The team uses a YOLO (You Only Live Once) approach to testing and validation, ensuring that the upgrade is thoroughly tested before deployment.
  • The upgrade process is monitored and optimized continuously, with improvements made based on feedback from users and performance metrics.