Merve Noyan - Keynote: Open-source Multimodal AI in the Wild | PyData Amsterdam 2024

Ai

Explore open-source multimodal AI models rivaling proprietary solutions, with insights on deployment, PEFT techniques, and practical applications in vision-language tasks.

Key takeaways
  • Open source multimodal AI models are catching up with closed-source alternatives, offering comparable performance especially in vision-language tasks

  • Visual language models can handle multiple tasks including object detection, segmentation, image classification, and document AI without being restricted to specific classes

  • Key open-source models discussed:

    • PolyGEMMA (Google) - Supports multilingual understanding
    • OWL (Google Brain) - Zero-shot object detection
    • QN2VL (Alibaba) - Strong performance in multimodal tasks
    • Lava - Changed the paradigm for multimodal retrieval
  • Parameter Efficient Fine Tuning (PEFT) techniques like LoRa and QLara enable model customization with fewer resources:

    • Add adapters on top of base models
    • Support 4-bit and 8-bit quantization
    • Reduce memory requirements
  • Document AI applications benefit from multimodal models by:

    • Processing text and images together
    • Understanding layout and spatial relationships
    • Avoiding brittle OCR-only approaches
  • Deployment and optimization options include:

    • Local deployment for privacy
    • Browser-based inference
    • Quantization for reduced resource usage
    • Token streaming support
  • Important considerations for model selection:

    • Check licenses (MIT/Apache 2.0 preferred)
    • Consider hardware restrictions
    • Evaluate model size vs. performance needs
    • Match models to specific use cases
  • Development tools and frameworks highlighted:

    • Hugging Face Transformers
    • Text Generation Inference
    • Small Vision library
    • Bits and bytes for quantization