Multi-Modal Compression and Inference with vLLM: Exploring FlagScale's Application Practices and Technical Aspects
Large models have garnered significant attention due to their exceptional performance across various tasks. However, in resource-constrained scenarios, the substantial computational and memory resources required for inference present numerous challenges. As a result, the industry is actively developing technologies to enhance the inference efficiency of large models. This report will share FlagScale's practical experience in compressing and inferring multi-modal large models based on the vLLM framework: 1) a focus on analyzing the related modules, strategies, and performance of the vLLM framework with the newly added CFG Sampling feature; 2) using the llm-compressor tool to perform quantization compression of multi-modal models at varying granularities according to different deployment scenarios, while also exploring the differences between multi-modal models and language models, as well as how to achieve extreme compression of multi-modal models.