Guillame-Bert & Spektor - Safe, fast, and easy time series preprocessing with Temporian | SciPy 2024

Learn about Temporian, a Python library for time series preprocessing that prevents data leakage and offers high performance through C++. See examples and best practices.

Key takeaways
  • Temporian is a Python library for safe, simple and efficient preprocessing of temporal data, developed collaboratively by Google and Trial Labs

  • Key features include:

    • Prevention of future data leakage through explicit operators
    • High performance C++ core implementation
    • Support for different temporal data types (time series, sequences, multivariate data)
    • Native handling of hierarchical/indexed data
    • Integration with common ML/data science tools
  • Data is handled through “event sets” - the core data structure that unifies different temporal data types

    • Supports various timestamp formats and value types (int, float, boolean, string)
    • Preserves hierarchical structure of data
    • Enables efficient operations through memory optimizations
  • Operations are chainable with a functional API:

    • No side effects or modifications to original data
    • Each operation returns new event sets
    • Includes moving windows, resampling, aggregations
    • Supports arbitrary Python functions through map operator
  • Current limitations and status:

    • Version 0.6 (pre-1.0)
    • Single-threaded execution (multithreading planned)
    • No native C++ interface yet
    • Focused on local execution but Apache Beam integration possible for large-scale processing
  • Design philosophy emphasizes:

    • Safety over speed for preprocessing operations
    • Familiar API similar to pandas
    • Minimal development and maintenance costs
    • Unix philosophy of doing one thing well