Nicolò Giso - From telemetry data to CSVs with Python, Spark and Azure Databricks

Transform telemetry data into CSV files using Python, Spark, and Azure Databricks, and discover how to prepare the data for analysis and machine learning models by filling missing values and removing duplicates.

Key takeaways
  • The presentation is about transforming telemetry data into CSV files using Python, Spark, and Azure Databricks.
  • The goal is to provide a manageable format for data analysis and machine learning models.
  • The data is collected from field machineries and devices, and the architecture consists of eight Databricks notebooks, each with a specific task.
  • The data is first transformed into a manageable format using the “Pivo” notebook, which transforms JSON lines into CSV files with variable SAS columns and a UTC timestamp as an index.
  • The data is then processed to fill missing values, remove duplicates, and truncate the UTC timestamp to the second.
  • The data is further processed to pivot and group the data by variable and timestamp, and finally, the data is saved to Azure Data Lake.
  • The compute KPI notebook calculates basic KPIs such as mean, minimum, and maximum, and stores the results in a Cosmos DB collection.
  • The presentation highlights the use of Azure Data Factory for orchestration, and the integration with Visual Studio Code for local development.
  • The speakers mention the importance of data cleansing and formatting for accurate analysis and machine learning models.
  • The presentation concludes with the demonstration of the final result and the next steps for improving the solution.