Data Streaming? I don't even know her by Julien Contarin

Learn how data streaming powers modern applications, from Apache Kafka basics to emerging standards. Discover best practices for real-time data processing & architecture.

Key takeaways
  • Apache Kafka remains the core open standard for event streaming, powering most modern event-driven applications

  • Key components of modern data streaming architecture:

    • Stream: Real-time messaging and data transport
    • Connect: Integration with databases, SaaS solutions and other systems
    • Govern: Schema management, security, lineage tracking
    • Process: Data transformation and enrichment
  • Storage costs have decreased significantly in cloud environments, but compute remains expensive - focus should be on optimizing compute usage

  • Shift-left approach recommended for data processing - handle transformations upstream closer to data production rather than downstream

  • Data products should be:

    • Discoverable through catalogs
    • Schema-governed
    • Producer-owned
    • Available to consumers in real-time
    • Secured and properly governed
  • Modern data architecture considerations:

    • Multi-tenancy support
    • Quota management
    • Cost optimization through elastic scaling
    • Integration with analytical and operational systems
    • Support for both real-time and batch processing
  • Emerging standards and technologies:

    • Apache Iceberg for table formats
    • Apache Flink for stream processing
    • Kafka Connect for standardized integrations
    • KRAFT replacing ZooKeeper
  • Focus shifting from just analytical data products to universal data products that serve both operational and analytical needs

  • Cloud-native services should provide:

    • Automatic scaling
    • Cost-effective resource utilization
    • Managed infrastructure
    • Built-in high availability
  • Data streaming is becoming foundational for modern use cases including real-time analytics, AI/ML, and operational applications