Ivan Moshkov & Daria Gitman - How to Build an LLM for Math Reasoning without Proprietary Data?

Discover how to build large language models for mathematical reasoning using open-source data, synthetic datasets, and innovative training approaches to achieve GPT-4 level results.

Key takeaways
  • Building LLMs for math reasoning without proprietary data requires generating synthetic datasets using open source models and fine-tuning techniques

  • Key datasets used were GSM-8K (grade school math) with 7.5K training samples and MAS (university level math) with 7.5K samples across different math topics

  • Three main solution approaches were explored:

    • Text-based solutions (human-readable)
    • Code-based solutions (using Python)
    • Code interpreter style (combining text reasoning with executable code)
  • The code interpreter approach worked best by allowing models to:

    • Write natural text explanations
    • Execute Python code for calculations
    • Return results back to continue reasoning
  • Model development pipeline involved:

    • Pre-training on large general corpus
    • Supervised fine-tuning on math problems
    • Chat fine-tuning for assistant-like behavior
  • Techniques for improving results:

    • Using few-shot demonstrations to guide solution format
    • Handling arithmetic errors through code execution
    • Filtering out “cheating” solutions that just copy answers
    • Generating multiple solutions per problem (128-256) for diversity
  • Achieved competitive results without using proprietary OpenAI data:

    • Comparable performance to leading models on GSM-8K
    • Within reach of GPT-4 on MAS dataset
  • Custom Data Explorer tool was developed for:

    • Visualizing and analyzing model outputs
    • Identifying common error patterns
    • Streamlining inference and evaluation
    • Supporting LLM-specific data analysis needs
  • Key challenges included:

    • Getting models to show reasoning vs just outputting answers
    • Handling arithmetic mistakes in pure text solutions
    • Ensuring diversity in synthetic training data
    • Creating isolated sandboxes for safe code execution