We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Ivan Moshkov & Daria Gitman - How to Build an LLM for Math Reasoning without Proprietary Data?
Discover how to build large language models for mathematical reasoning using open-source data, synthetic datasets, and innovative training approaches to achieve GPT-4 level results.
- 
    
Building LLMs for math reasoning without proprietary data requires generating synthetic datasets using open source models and fine-tuning techniques
 - 
    
Key datasets used were GSM-8K (grade school math) with 7.5K training samples and MAS (university level math) with 7.5K samples across different math topics
 - 
    
Three main solution approaches were explored:
- Text-based solutions (human-readable)
 - Code-based solutions (using Python)
 - Code interpreter style (combining text reasoning with executable code)
 
 - 
    
The code interpreter approach worked best by allowing models to:
- Write natural text explanations
 - Execute Python code for calculations
 - Return results back to continue reasoning
 
 - 
    
Model development pipeline involved:
- Pre-training on large general corpus
 - Supervised fine-tuning on math problems
 - Chat fine-tuning for assistant-like behavior
 
 - 
    
Techniques for improving results:
- Using few-shot demonstrations to guide solution format
 - Handling arithmetic errors through code execution
 - Filtering out “cheating” solutions that just copy answers
 - Generating multiple solutions per problem (128-256) for diversity
 
 - 
    
Achieved competitive results without using proprietary OpenAI data:
- Comparable performance to leading models on GSM-8K
 - Within reach of GPT-4 on MAS dataset
 
 - 
    
Custom Data Explorer tool was developed for:
- Visualizing and analyzing model outputs
 - Identifying common error patterns
 - Streamlining inference and evaluation
 - Supporting LLM-specific data analysis needs
 
 - 
    
Key challenges included:
- Getting models to show reasoning vs just outputting answers
 - Handling arithmetic mistakes in pure text solutions
 - Ensuring diversity in synthetic training data
 - Creating isolated sandboxes for safe code execution