We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Genomic-scale Data Pipelines • Lynn Langit & Denis Bauer • YOW! 2017
Learn how Lynn Langit and Denis Bauer tackle genomic-scale data analysis using Apache Spark, machine learning, and serverless architectures to process billions of DNA data points efficiently.
-
Variant Spark is a custom machine learning library built on top of Apache Spark Core, designed to analyze genomic-scale data with better performance than existing solutions
-
The human genome contains 3 billion letters (DNA base pairs) and around 2 million differences between individuals, making genomic analysis a massive big data challenge
-
Random forest machine learning algorithms were chosen for genomic analysis because they handle outliers well, require less data cleaning, and can work with both small and large dimensional datasets
-
Serverless architectures proved effective for genomic search applications (GT Scan), reducing costs and improving accessibility for researchers compared to traditional server clusters
-
Jupyter notebooks are crucial for bioinformatics workflows, enabling:
- Code execution in multiple languages (Python, R, Scala)
- Interactive visualizations
- Documentation and reproducibility
- Easy sharing among researchers
-
Cloud technologies like Amazon EMR and Databricks provide the necessary computational power, but costs need optimization (e.g., $9,000 cluster costs being too high for frequent testing)
-
Performance testing showed 80% runtime improvements by optimizing serverless architectures and careful evaluation of cloud services
-
Integration of visualization tools is critical for understanding complex genomic data and validating machine learning results
-
The field of genomics represents one of the largest big data challenges, with predictions that 50% of the world’s population will be sequenced by 2025
-
Cross-disciplinary collaboration between bioinformaticians and software engineers is essential for building effective genomic data pipelines