After the paper presentation, both the host and the audience expressed great interest in the paper and engaged in thoughtful questions. Our guest of honor answered each query in detail, leading to a lively and insightful discussion. The following is a partial Q&A record:
Q1 (Host): What would be the main difference between cache compression and video streaming encoding? And what is the overhead for the compression process?
A1: The compression in cache is analogous to video compression, but there are key differences. For example, we use LM-specific characteristics, such as more aggressive quantization for the later layers of KV. Since the compression is implemented on a GPU, the overhead is very small. It’s much faster than the prefill process itself.
Q2 (Host): What happens if the KV cache is swapped or updated? Is the compression handled incrementally, or is only the swapped part compressed separately?
A2: When the KV cache is updated, the new KV cache is appended to the existing one, and the new part is compressed separately. The previous KV cache remains unchanged. This design allows for efficient handling of cache updates without recompressing the entire cache.
Q3 (Audience 1): Can CacheGen’s decompression and decode running on the AI phone or AI PC?
A3: We haven’t tested this on low-power devices yet. We currently run it on GPUs for LMS inference. However, practically, it should be possible to apply it to any device.
Q4 (Audience 2): Do some traditional compression methods (e.g. Huffman encoding) work for KV cache? If not, what are the specific challenges KV cache poses compared to other kinds of data?
A4: We didn’t specifically try Huffman encoding, but we experimented with passing the original floating-point tensors. Traditional methods like gzip don’t work well with KB cache because it’s not sparse data.
Q5 (Audience 3): How does CacheGen handle the challenges posed by sparse attention mechanisms (like DeepSeek)?
Q5: Sparse attention is somewhat similar to H2O [1] compression, where certain tokens are dropped from the KV cache. These cache compression methods should still work for sparse attention, but may need fine-tuning to optimize the compression ratio.
Q6 (Host): Is there any potential for future improvements or applications, and what’s next for CacheGen?
A6: The team is working on further developing CacheGen into a production-ready tool and is involved in open-source projects aimed at improving the speed of large language model inference. They are also exploring how CacheGen can work in other settings, like reducing the size of large language models.
[1] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. 2023. H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. (2023). arXiv:cs.LG/2306.14048