Paper: SCX: Stateless KV-Cache Encoding for Cloud-Scale Confidential Transformer Serving
Authors: Mu Yuan, Lan Zhang, Liekang Zeng, Siyang Jiang, Bufang Yang, Di Duan, Guoliang Xing
Presenter: Guorui Xu, Southeast University
Guest of Honor: Mu Yuan, The Chinese University of Hong Kong
Q: What do you think are the main challenges in achieving privacy-preserving LLM inference?
A: The core challenge is balancing efficiency and privacy protection. Cryptology-based approaches such as Homomorphic Encryption (HE) and MPC frameworks offer theoretically guaranteed security. Still, the communication and computational costs are so high that they apply only to models with several million parameters. When we scale to several billion or hundreds of billions of parameters, the efficiency is too low for real-world use. We also explored Trusted Execution Environments (TEEs) on CPUs or GPUs, but deploying the entire model within a TEE still yields insufficient efficiency at scale.
Q: Would it be possible to deploy only part of the large language model to the TEE while keeping the remainder in standard software?
A: That is precisely our design goal — to place the minimum necessary portion of the LLM within the TEE. In our evaluations, the efficiency bottleneck lies in memory movement between the CPU TEE and GPU memory. This bottleneck becomes significant as context length increases, such as to 10K or even 100K tokens, because the internal KV cache becomes too large to move efficiently.
Q: You mentioned reducing communication overhead via KV cache compression — is that correct?
A: Yes, KV cache compression is one key technique we use to address that bottleneck.
Q: Would it be possible to offload the confidential computation to a smart NIC with crypto accelerators?
A: One of our ongoing projects examines the use of a smart NIC for KV cache encoding and decoding, though it is currently focused on performance rather than privacy. The key technical challenge we have encountered is the memory bottleneck of the smart NIC, since its memory hierarchy is much more limited than that of a server.
Q: You mentioned a new insight into attacks. Could you elaborate?
We found that not only SCX but also most existing confidential LLM inference approaches are vulnerable to a new attack that we developed. The attack leverages statistical information from multiple rounds of user-cloud interactions within one session, because existing protection approaches use the same encoding keys throughout a single session. By collecting enough samples within a session, we can successfully recover some user inputs and outputs. Our ongoing work addresses this by making the internal hidden states use one-time keys — even within the same session, each forward inference uses a different key.
Q: Could local keys be secured by offloading to the TEE on the cloud side?
A: In our implementation, we use the cloud-side TEE. This design choice was largely driven by our collaboration with Huawei, whose CloudWheel infrastructure supports such a cloud-side TEE.
Q: Why is the cloud treated as the attacker in your threat model?
A: This is a classical threat model in the security community. The privacy concern primarily stems from business users—for example, at the start of this project, we encountered this requirement from government clients who did not fully trust cloud service providers with sensitive data, regardless of the provider. Application developers also often do not want to share private datasets with the cloud.
Q: Based on the LLaMA 13B latency breakdown, SCX shifts the core challenge from secure computation to intra-host memory movement between CPU TEE and GPU. Would the most critical optimization lie in tweaking algorithms or upgrading data movement layers like PCIe and CXL?
A: I think the most important contribution of this paper is to show that this design direction is worth pursuing. We have already updated several versions of SCX in our ongoing work. In my opinion, the most critical optimization should lie in incremental algorithmic tuning, rather than relying on hardware upgrades.
Q: GPUs like the H100 already have confidential computing capability. What is your view on how SCX could influence future hardware design?
We argue that the hardware design should not make the entire GPU confidential or non—confidential; it should support a partial design, in which only part of the GPU memory has confidential computing capability, similar to how a CPU TEE works. We have implemented a simulation in which part of the H100 confidential computing memory serves as the user-side trusted region, and the remainder of the GPU memory is used for regular computation. The result is near-plaintext efficiency, and we tested models up to 700 billion parameters with negligible latency overhead.
Q: Does SCX support prefix caching across user sessions?
A: SCX supports KV cache reuse within the same user session, but not across different user sessions, because different sessions require different keys. If you are willing to relax your privacy protection requirements, you can reuse keys across sessions and enable prefix caching; this is a trade-off between privacy and efficiency.
Q: This paper targets transformer models that rely on the KV cache mechanism. Is SCX applicable to non-transformer architectures?
A: SCX can only apply to transformer models. For other architectures, such as LSTM- or CNN-based models, traditional HE or MPC techniques remain applicable because these models are smaller in scale, thereby keeping latency acceptable. We focus on transformers because only transformer-scale models— those with hundreds of billions of parameters—face efficiency constraints that render traditional cryptographic approaches infeasible.
Q: What are the potential risks of adopting SCX at a larger scale?
A: The main limitation is the length of context. As context length increases, both the KV cache and the intermediate hidden states scale linearly, significantly increasing memory movement overhead between CPU TEE and GPU memory.
If new attack techniques continue to emerge, does this imply that the cloud must continuously deploy new protection mechanisms? Is this overhead acceptable?**
A: Each new protection technique must defend against all previous attack techniques, not just the newly discovered one, so there is no need to stack multiple mechanisms. In practice, these updates are infrequent—the last breakthrough in attacks in this space was roughly two years ago, so there is typically a one-year window to design and deploy updated protections.
Q: From the cloud side, does deploying SCX require significant effort?
A: Based on our collaboration with companies, the deployment effort is not great. We did not modify the hardware and implemented our algorithm within popular inference frameworks, such as vLLM, so that cloud providers can use their existing inference infrastructure directly.
Q: As agentic applications increasingly require users to upload sensitive context — such as emails, files, and long-term memory — to cloud models, does this introduce new challenges to the threat model?
A: Yes, that is a very good question. Most secure LLM inference work, including SCX, only considers the isolated inference process. However, agentic systems rely on numerous third-party APIs, such as search engines, RAG, and external tools. If our protocol encodes inference tokens, they cannot be directly interpreted by external APIs, such as Google Search. Currently, almost no secure protocol supports third-party API calls. The only theoretical solution is for all API providers to adopt a single encoding and decoding protocol, which is highly impractical. Securing the agentic loop would require securing not only LLM inference but also search, memory, and all external services within the loop.
Q: What was the most challenging part of this work, and what took the most time?
A: The most time-consuming part was the theory — specifically the iterative algorithm design. Every time we identified an algorithm that performed well with respect to privacy, we found its efficiency was too low, and we had to redesign. Then, when efficiency was high, we would identify a gap on the privacy side. This loop lasted about three to four months, during which we discussed algorithm design every day, then ran self-proofs and efficiency evaluations for each candidate, until we arrived at an SCX design that performed well on both dimensions.
Q: How do you ensure mathematical equivalence between the SCX output and the plaintext-inference output?
A: Our key design principle is that encoding and decoding errors must not accumulate across operators. Once an error occurs in the transformed data space, we immediately transfer the affected result to the user side for decoding and recovery of the original data, then re-encode and resend it to the GPU. This prevents error accumulation. In the paper, we present a formal theorem with an appendix containing a proof, and we empirically show that the absolute difference between SCX output and plaintext output is negligible—attributable solely to hardware floating-point precision.
Q: For new students who want to get started in this area, what is your advice?
A: This area requires both deep theory and strong system implementation skills, which makes it difficult to enter. In my experience, collaboration with industry is essential—not only for access to computing resources but also for understanding real application requirements. Publishing in top venues like SIGCOMM or OSDI today is increasingly resource-sensitive: you need hardware, real-world service experience, and data that often come from company partnerships. I would encourage new students to pursue industry internships during their PhD and maintain close contact with industry to remain grounded in real-world problems.