We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Fast Inference for Language Models | PyData Yerevan May 2022
Understand how fast inference for language models can be achieved through techniques like quantization, binarization, and pruning, and learn about the benefits of compiler optimization and measuring quality in this talk from PyData Yerevan 2022.
Understanding Fast Inference for Language Models
Inference for language models can take a long time due to the floating-point arithmetic operations involved. Therefore, researchers have developed techniques to speed it up, such as quantization, binarization, and pruning.
There are different types of quantization like static and dynamic, distillation, and knowledge distillation. Techniques like kernel fusion and custom compiled models can also improve performance.
Benefits of Quantization
Quantizing a model from float-32 to float-16 or int8 with 8-bit integer arithmetic significantly reduces the number of bits required, calculations, and memory usage, resulting in 3-4 times faster inference in some cases.
Compiler optimization
Compilers can optimize GPU code for machine learning but not all software packages can do custom optimizations. Some libraries like AMP, cuDNN, are built to optimize performance.
Tensor cores on modern GPUs can already perform matrix multiplications while maintaining high precision, but not reaching the same level of performance as neural network accelerators.
Measuring Quality
Semantic Spatial Similarity (SSS) using the Glue benchmark is used to assess the quality of models.
Comparison of Performance
Different libraries perform similarly in terms of speed, but only a few can be used by default without manual configuration. Some configurations can be more effective than others without the need for custom code in PyTorch.
Demo and Future Plans
The presenter has included a demo and future plans, including improving performance, reducing memory needs, and exploring other optimization ideas.