From ML to LLM: on-device AI in the browser by Nico Martin

Explore how to run machine learning and LLMs directly in the browser using WebGPU, WebAssembly & TensorFlow.js. Learn about RAG, quantization & privacy benefits.

Key takeaways

Modern browsers now support on-device AI/ML through WebGPU API, WebAssembly, and TensorFlow.js backends for accelerated neural network processing
Running LLMs in browser requires handling large model sizes (1.4+ GB) but enables privacy-preserving, offline-capable AI features with no server costs
WebNN API proposal aims to provide standardized access to AI-optimized hardware (TPU, NPU) across different devices and browsers
RAG (Retrieval Augmented Generation) can be implemented entirely client-side to ground LLM responses in local documents and prevent hallucination
Real-time tasks like speech recognition, image detection achieve 30+ FPS through WebGPU acceleration vs 5 FPS on CPU
Quantization reduces model size by using 4-bit precision instead of 32-bit, making browser deployment more feasible
Progressive enhancement approach recommended - AI features should enhance core functionality rather than being required
Models and weights can be cached locally after initial download for improved performance
Open source tools like Transformers.js, ONNX Runtime Web, and TensorFlow.js enable browser-based ML development
On-device AI allows building privacy-preserving applications that work offline without sending sensitive data to cloud providers

From ML to LLM: on-device AI in the browser by Nico Martin

More talks