Running open large language models in production with Ollama and serverless GPUs by Wietse Venema

Learn how to deploy open LLMs in production using Ollama and serverless GPUs. Covers model quantization, Cloud Run deployment, security, and building agentic AI systems.

Key takeaways
  • Ollama provides a user-friendly and performant LLM inference server that works well with open models like Google’s Gemma, supporting both local and cloud deployment

  • GPU acceleration is crucial for production LLM inference speed - NVIDIA L4 GPUs with 24GB VRAM can handle 9B parameter models effectively

  • Model quantization (reducing precision from 32-bit to 4/8-bit) helps optimize memory usage while maintaining acceptable performance

  • Cloud Run provides key benefits for LLM deployment:

    • Auto-scaling based on request load
    • Pay-per-use pricing
    • Fast cold starts through optimized container image handling
    • Built-in request concurrency management
    • Zero configuration needed for HTTPS endpoints
  • Request concurrency limits are essential for LLM inference workloads as performance degrades quickly with parallel requests

  • Langchain helps orchestrate complex LLM workflows through a graph-based approach, enabling tools and autonomous decision-making

  • For production systems, consider:

    • Authentication and access controls
    • Network isolation
    • Resource limits
    • Proper error handling
    • Request timeout management
  • Open models provide advantages like:

    • Full control over deployment
    • No vendor lock-in
    • Ability to run offline
    • Customization potential
    • Lower operational costs
  • Gradio enables rapid prototyping of LLM demo applications with minimal code

  • Building agentic systems requires careful prompt engineering and proper tooling to give LLMs autonomy while maintaining control and safety