We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
"Four years of Datomic powered ETL in anger with CANDEL" by Ben Kamphaus and Marshall Thompson
Discover the design and implementation of a Datomic-powered ETL pipeline for processing genomic data, tackling data science challenges and demonstrating the power of schema-based databases.
- Different data formats require different tools: The speakers highlight the difficulty of working with various data formats, such as JSON, CSV, and genomic data.
- ETL pipeline: The pipeline was built using Clojure, Datomic, and ETL (Extract, Transform, Load) to manage data flow and processing.
- Data science challenges: The speakers discuss the difficulties of data science work, including data cleaning, feature extraction, and dealing with large datasets.
- Datomic use case: The speakers demonstrate the use of Datomic as a schema-based database for storing and querying heterogeneous genomic data.
- Project background: The project aims to support the development of breakthrough immune therapies, with a focus on multi-omics data and machine learning.
- Data processing: The speakers emphasize the importance of properly processing raw data, including data preprocessing, feature extraction, and data integration.
- Schema evolution: The schema has evolved over time to accommodate new features and data types, highlighting the need for dynamic schema management.
- Data visualization: The speakers show examples of data visualization tools, including Mantis and Scatter Plot, for interactively exploring genomic data.
- Project outcomes: The project has enabled the processing of large datasets, improved data integration, and accelerated research in the field.
- Lessons learned: The speakers highlight the importance of proper data processing, using a schema-based database, and adapting to changing data requirements.
- Future directions: The speakers suggest future improvements, including better support for distributed queries and improved data visualization tools.
- Software engineering: The speakers emphasize the importance of software engineering principles, such as scalability, reliability, and maintainability, in building a data science pipeline.