Running open large language models in production with Ollama and serverless GPUs by Wietse Venema

Security

Learn how to deploy open LLMs in production using Ollama and serverless GPUs. Covers model quantization, Cloud Run deployment, security, and building agentic AI systems.

Key takeaways

Ollama provides a user-friendly and performant LLM inference server that works well with open models like Google’s Gemma, supporting both local and cloud deployment
GPU acceleration is crucial for production LLM inference speed - NVIDIA L4 GPUs with 24GB VRAM can handle 9B parameter models effectively
Model quantization (reducing precision from 32-bit to 4/8-bit) helps optimize memory usage while maintaining acceptable performance
Cloud Run provides key benefits for LLM deployment:
- Auto-scaling based on request load
- Pay-per-use pricing
- Fast cold starts through optimized container image handling
- Built-in request concurrency management
- Zero configuration needed for HTTPS endpoints
Request concurrency limits are essential for LLM inference workloads as performance degrades quickly with parallel requests
Langchain helps orchestrate complex LLM workflows through a graph-based approach, enabling tools and autonomous decision-making
For production systems, consider:
- Authentication and access controls
- Network isolation
- Resource limits
- Proper error handling
- Request timeout management
Open models provide advantages like:
- Full control over deployment
- No vendor lock-in
- Ability to run offline
- Customization potential
- Lower operational costs
Gradio enables rapid prototyping of LLM demo applications with minimal code
Building agentic systems requires careful prompt engineering and proper tooling to give LLMs autonomy while maintaining control and safety

Running open large language models in production with Ollama and serverless GPUs by Wietse Venema

More talks