Genomic-scale Data Pipelines • Lynn Langit & Denis Bauer • YOW! 2017

Learn how Lynn Langit and Denis Bauer tackle genomic-scale data analysis using Apache Spark, machine learning, and serverless architectures to process billions of DNA data points efficiently.

Key takeaways
  • Variant Spark is a custom machine learning library built on top of Apache Spark Core, designed to analyze genomic-scale data with better performance than existing solutions

  • The human genome contains 3 billion letters (DNA base pairs) and around 2 million differences between individuals, making genomic analysis a massive big data challenge

  • Random forest machine learning algorithms were chosen for genomic analysis because they handle outliers well, require less data cleaning, and can work with both small and large dimensional datasets

  • Serverless architectures proved effective for genomic search applications (GT Scan), reducing costs and improving accessibility for researchers compared to traditional server clusters

  • Jupyter notebooks are crucial for bioinformatics workflows, enabling:

    • Code execution in multiple languages (Python, R, Scala)
    • Interactive visualizations
    • Documentation and reproducibility
    • Easy sharing among researchers
  • Cloud technologies like Amazon EMR and Databricks provide the necessary computational power, but costs need optimization (e.g., $9,000 cluster costs being too high for frequent testing)

  • Performance testing showed 80% runtime improvements by optimizing serverless architectures and careful evaluation of cloud services

  • Integration of visualization tools is critical for understanding complex genomic data and validating machine learning results

  • The field of genomics represents one of the largest big data challenges, with predictions that 50% of the world’s population will be sequenced by 2025

  • Cross-disciplinary collaboration between bioinformaticians and software engineers is essential for building effective genomic data pipelines