Fast Inference for Language Models | PyData Yerevan May 2022

Understand how fast inference for language models can be achieved through techniques like quantization, binarization, and pruning, and learn about the benefits of compiler optimization and measuring quality in this talk from PyData Yerevan 2022.

Key takeaways

Understanding Fast Inference for Language Models

Inference for language models can take a long time due to the floating-point arithmetic operations involved. Therefore, researchers have developed techniques to speed it up, such as quantization, binarization, and pruning.

There are different types of quantization like static and dynamic, distillation, and knowledge distillation. Techniques like kernel fusion and custom compiled models can also improve performance.

Benefits of Quantization

Quantizing a model from float-32 to float-16 or int8 with 8-bit integer arithmetic significantly reduces the number of bits required, calculations, and memory usage, resulting in 3-4 times faster inference in some cases.

Compiler optimization

Compilers can optimize GPU code for machine learning but not all software packages can do custom optimizations. Some libraries like AMP, cuDNN, are built to optimize performance.

Tensor cores on modern GPUs can already perform matrix multiplications while maintaining high precision, but not reaching the same level of performance as neural network accelerators.

Measuring Quality

Semantic Spatial Similarity (SSS) using the Glue benchmark is used to assess the quality of models.

Comparison of Performance

Different libraries perform similarly in terms of speed, but only a few can be used by default without manual configuration. Some configurations can be more effective than others without the need for custom code in PyTorch.

Demo and Future Plans

The presenter has included a demo and future plans, including improving performance, reducing memory needs, and exploring other optimization ideas.