Merve Noyan - Keynote: Open-source Multimodal AI in the Wild | PyData Amsterdam 2024

Explore open-source multimodal AI models rivaling proprietary solutions, with insights on deployment, PEFT techniques, and practical applications in vision-language tasks.

Key takeaways

Open source multimodal AI models are catching up with closed-source alternatives, offering comparable performance especially in vision-language tasks
Visual language models can handle multiple tasks including object detection, segmentation, image classification, and document AI without being restricted to specific classes
Key open-source models discussed:
- PolyGEMMA (Google) - Supports multilingual understanding
- OWL (Google Brain) - Zero-shot object detection
- QN2VL (Alibaba) - Strong performance in multimodal tasks
- Lava - Changed the paradigm for multimodal retrieval
Parameter Efficient Fine Tuning (PEFT) techniques like LoRa and QLara enable model customization with fewer resources:
- Add adapters on top of base models
- Support 4-bit and 8-bit quantization
- Reduce memory requirements
Document AI applications benefit from multimodal models by:
- Processing text and images together
- Understanding layout and spatial relationships
- Avoiding brittle OCR-only approaches
Deployment and optimization options include:
- Local deployment for privacy
- Browser-based inference
- Quantization for reduced resource usage
- Token streaming support
Important considerations for model selection:
- Check licenses (MIT/Apache 2.0 preferred)
- Consider hardware restrictions
- Evaluate model size vs. performance needs
- Match models to specific use cases
Development tools and frameworks highlighted:
- Hugging Face Transformers
- Text Generation Inference
- Small Vision library
- Bits and bytes for quantization

Merve Noyan - Keynote: Open-source Multimodal AI in the Wild | PyData Amsterdam 2024

More talks