Genomic-scale Data Pipelines • Lynn Langit & Denis Bauer • YOW! 2017

Learn how Lynn Langit and Denis Bauer tackle genomic-scale data analysis using Apache Spark, machine learning, and serverless architectures to process billions of DNA data points efficiently.

Key takeaways

Variant Spark is a custom machine learning library built on top of Apache Spark Core, designed to analyze genomic-scale data with better performance than existing solutions
The human genome contains 3 billion letters (DNA base pairs) and around 2 million differences between individuals, making genomic analysis a massive big data challenge
Random forest machine learning algorithms were chosen for genomic analysis because they handle outliers well, require less data cleaning, and can work with both small and large dimensional datasets
Serverless architectures proved effective for genomic search applications (GT Scan), reducing costs and improving accessibility for researchers compared to traditional server clusters
Jupyter notebooks are crucial for bioinformatics workflows, enabling:
- Code execution in multiple languages (Python, R, Scala)
- Interactive visualizations
- Documentation and reproducibility
- Easy sharing among researchers
Cloud technologies like Amazon EMR and Databricks provide the necessary computational power, but costs need optimization (e.g., $9,000 cluster costs being too high for frequent testing)
Performance testing showed 80% runtime improvements by optimizing serverless architectures and careful evaluation of cloud services
Integration of visualization tools is critical for understanding complex genomic data and validating machine learning results
The field of genomics represents one of the largest big data challenges, with predictions that 50% of the world’s population will be sequenced by 2025
Cross-disciplinary collaboration between bioinformaticians and software engineers is essential for building effective genomic data pipelines

Genomic-scale Data Pipelines • Lynn Langit & Denis Bauer • YOW! 2017

More talks