We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Ivan Moshkov & Daria Gitman - How to Build an LLM for Math Reasoning without Proprietary Data?
Discover how to build large language models for mathematical reasoning using open-source data, synthetic datasets, and innovative training approaches to achieve GPT-4 level results.
-
Building LLMs for math reasoning without proprietary data requires generating synthetic datasets using open source models and fine-tuning techniques
-
Key datasets used were GSM-8K (grade school math) with 7.5K training samples and MAS (university level math) with 7.5K samples across different math topics
-
Three main solution approaches were explored:
- Text-based solutions (human-readable)
- Code-based solutions (using Python)
- Code interpreter style (combining text reasoning with executable code)
-
The code interpreter approach worked best by allowing models to:
- Write natural text explanations
- Execute Python code for calculations
- Return results back to continue reasoning
-
Model development pipeline involved:
- Pre-training on large general corpus
- Supervised fine-tuning on math problems
- Chat fine-tuning for assistant-like behavior
-
Techniques for improving results:
- Using few-shot demonstrations to guide solution format
- Handling arithmetic errors through code execution
- Filtering out “cheating” solutions that just copy answers
- Generating multiple solutions per problem (128-256) for diversity
-
Achieved competitive results without using proprietary OpenAI data:
- Comparable performance to leading models on GSM-8K
- Within reach of GPT-4 on MAS dataset
-
Custom Data Explorer tool was developed for:
- Visualizing and analyzing model outputs
- Identifying common error patterns
- Streamlining inference and evaluation
- Supporting LLM-specific data analysis needs
-
Key challenges included:
- Getting models to show reasoning vs just outputting answers
- Handling arithmetic mistakes in pure text solutions
- Ensuring diversity in synthetic training data
- Creating isolated sandboxes for safe code execution