"Four years of Datomic powered ETL in anger with CANDEL" by Ben Kamphaus and Marshall Thompson

Discover the design and implementation of a Datomic-powered ETL pipeline for processing genomic data, tackling data science challenges and demonstrating the power of schema-based databases.

Key takeaways
  • Different data formats require different tools: The speakers highlight the difficulty of working with various data formats, such as JSON, CSV, and genomic data.
  • ETL pipeline: The pipeline was built using Clojure, Datomic, and ETL (Extract, Transform, Load) to manage data flow and processing.
  • Data science challenges: The speakers discuss the difficulties of data science work, including data cleaning, feature extraction, and dealing with large datasets.
  • Datomic use case: The speakers demonstrate the use of Datomic as a schema-based database for storing and querying heterogeneous genomic data.
  • Project background: The project aims to support the development of breakthrough immune therapies, with a focus on multi-omics data and machine learning.
  • Data processing: The speakers emphasize the importance of properly processing raw data, including data preprocessing, feature extraction, and data integration.
  • Schema evolution: The schema has evolved over time to accommodate new features and data types, highlighting the need for dynamic schema management.
  • Data visualization: The speakers show examples of data visualization tools, including Mantis and Scatter Plot, for interactively exploring genomic data.
  • Project outcomes: The project has enabled the processing of large datasets, improved data integration, and accelerated research in the field.
  • Lessons learned: The speakers highlight the importance of proper data processing, using a schema-based database, and adapting to changing data requirements.
  • Future directions: The speakers suggest future improvements, including better support for distributed queries and improved data visualization tools.
  • Software engineering: The speakers emphasize the importance of software engineering principles, such as scalability, reliability, and maintainability, in building a data science pipeline.