From ML to LLM: on-device AI in the browser by Nico Martin

Ai

Explore how to run machine learning and LLMs directly in the browser using WebGPU, WebAssembly & TensorFlow.js. Learn about RAG, quantization & privacy benefits.

Key takeaways
  • Modern browsers now support on-device AI/ML through WebGPU API, WebAssembly, and TensorFlow.js backends for accelerated neural network processing

  • Running LLMs in browser requires handling large model sizes (1.4+ GB) but enables privacy-preserving, offline-capable AI features with no server costs

  • WebNN API proposal aims to provide standardized access to AI-optimized hardware (TPU, NPU) across different devices and browsers

  • RAG (Retrieval Augmented Generation) can be implemented entirely client-side to ground LLM responses in local documents and prevent hallucination

  • Real-time tasks like speech recognition, image detection achieve 30+ FPS through WebGPU acceleration vs 5 FPS on CPU

  • Quantization reduces model size by using 4-bit precision instead of 32-bit, making browser deployment more feasible

  • Progressive enhancement approach recommended - AI features should enhance core functionality rather than being required

  • Models and weights can be cached locally after initial download for improved performance

  • Open source tools like Transformers.js, ONNX Runtime Web, and TensorFlow.js enable browser-based ML development

  • On-device AI allows building privacy-preserving applications that work offline without sending sensitive data to cloud providers