Prakshi Yadav - Data lake: Design for schema evolution

Security

A data lake design expert discusses the challenges of schema evolution, highlighting Avro's solution and key characteristics of a scalable, reliable, and fault-tolerant data lake.

Key takeaways

Key Takeaways

A data lake is a centralized repository that stores all types of data in its raw form, allowing for flexible schema on read and write.
Schema evolution is a critical problem in a data lake, as data schema can change over time due to changes in data structure or data type.
Avro provides a solution to schema evolution by allowing for backwards and forwards compatibility, and can store both the data and its schema in a single file.
The key characteristics of a data lake include scalability, reliability, and fault tolerance.
Storage classes should be designed to optimize storage and retrieval of data, and should include features such as data compression and encoding.
Search and query should be easily accessible and easily searchable, with the ability to provide metadata and query results in real-time.
Processing should be able to handle large volumes of data, and should be designed to handle parallel processing and distributed computing.
Role-based access control should be implemented to ensure data security and integrity.
Data should be encrypted both at rest and in transit to ensure data security and integrity.
The storage solution should be designed to handle large volumes of data and should be scalable on demand.
Data serialization should be designed to handle complex nested JSON data and should be able to handle data of various types.
The schema registry should be designed to handle multiple versions of the schema and should provide a centralized location for storing and managing schema versions.

Prakshi Yadav - Data lake: Design for schema evolution

Key Takeaways

More talks