Arthur Andres - Unified batch and stream processing in python | PyData Global 2023

Discover Beavers, a Python library for unified batch and stream processing, and learn how it simplifies data flow orchestration with reusable code, flexible data representation, and zero-copy concatenation.

Key takeaways
  • Batch and stream processing in Python can be cumbersome to manage.
  • Existing libraries for streaming are not suitable for many industries.
  • Beavers is a Python library that allows for unified batch and stream processing.
  • Beavers allows for reusable code for both batch and stream jobs.
  • Beavers uses Kafka as a message broker and Arrow for in-memory data representation.
  • Beavers can be used to create a DAG (directed acyclic graph) to orchestrate data flow.
  • Beavers includes stream nodes, computational nodes, and sync nodes.
  • Stream nodes compute ephemeral events, computational nodes execute Python functions, and sync nodes provide output.
  • Beavers can be used to replay messages from Kafka topics in the correct order.
  • Beavers allows for flexible data representation and can be used with various serialization formats.
  • Beavers provides a type-safe environment and fast zero-copy concatenation of tables.
  • Beavers can be used to integrate with existing data sources and systems.