Building an open source data lake at scale in the cloud

Join this session to learn how to build an open source data lake at scale in the cloud using event-driven replication, Hive, S3, Kafka, and more, with a focus on scalability, reliability, and open-source tooling.

Key takeaways
  • Use metadata events to trigger replications instead of cron jobs
  • Use S3 for long-term storage and Hive for metadata services
  • Expose read-only endpoints to end-users and use Kubalt for deployment
  • Test data processing jobs thoroughly
  • Federate data access across regions and use event-based processing for scalability
  • Use a stream-first approach to minimize latency
  • Utilize open-source tools and contribute to their development
  • Implement a centralized platform for data lake and replicate data into another region for disaster recovery
  • Use a plugin architecture to allow for custom extensions
  • Use a distributed file system such as HDFS or S3 for storing data and metadata
  • Use Iceberg and Delta Lake for metadata-based data storage
  • Implement a proxy to expose a single API for federated Hive Metastores
  • Use Amazon S3 and EMR for big data capabilities
  • Contribute to open-source unit testing frameworks