Marzieh Fadaee - Keynote: The Art of Language: Mastering Multilingual Challenges in LLMs

Ai

Explore the complexities of building multilingual LLMs with Marzieh Fadaee. Learn about data challenges, cultural nuances, and successful strategies from the AYA project.

Key takeaways
  • Building multilingual language models brings unique challenges around cultural context, translation complexities, and data quality/availability for different languages

  • The AYA project created one of the largest open multilingual instruction datasets, covering 65 languages through community-driven data collection involving over 3,000 people from 119 countries

  • Language models often perform worse on low-resource languages and can exhibit catastrophic forgetting when adding new languages - careful balancing is needed between language coverage and model quality

  • Evaluation of multilingual models is particularly challenging due to:

    • Need for culturally-appropriate benchmarks for each language
    • Difficulty in comparing performance across languages
    • Limited availability of human evaluators for many languages
    • Translation artifacts affecting results
  • Cross-lingual transfer can occur both positively (languages helping each other improve) and negatively (performance degradation in one language when adding another)

  • Critical challenges in multilingual LLM development:

    • Data collection and quality for low-resource languages
    • Handling cultural nuances and biases
    • Balancing general vs. language-specific knowledge
    • Privacy and ethical considerations around data usage
    • Model transparency and accountability
  • Community involvement of native speakers is essential for:

    • Creating high-quality training data
    • Designing appropriate evaluation benchmarks
    • Understanding cultural context
    • Ensuring proper representation of languages
  • Open science and transparency in multilingual model development helps advance the field by allowing others to identify issues and build improvements