Building an open source data lake at scale in the cloud

Join this session to learn how to build an open source data lake at scale in the cloud using event-driven replication, Hive, S3, Kafka, and more, with a focus on scalability, reliability, and open-source tooling.

Key takeaways

Use metadata events to trigger replications instead of cron jobs
Use S3 for long-term storage and Hive for metadata services
Expose read-only endpoints to end-users and use Kubalt for deployment
Test data processing jobs thoroughly
Federate data access across regions and use event-based processing for scalability
Use a stream-first approach to minimize latency
Utilize open-source tools and contribute to their development
Implement a centralized platform for data lake and replicate data into another region for disaster recovery
Use a plugin architecture to allow for custom extensions
Use a distributed file system such as HDFS or S3 for storing data and metadata
Use Iceberg and Delta Lake for metadata-based data storage
Implement a proxy to expose a single API for federated Hive Metastores
Use Amazon S3 and EMR for big data capabilities
Contribute to open-source unit testing frameworks

Building an open source data lake at scale in the cloud

More talks