Advancing Security for Large Language Models with NVIDIA GPUs and Edgeless Systems

Edgeless Systems introduced Continuum AI, the first generative AI framework that keeps prompts encrypted at all times with confidential computing by combining confidential VMs with NVIDIA H100 GPUs and secure sandboxing.

The launch of this platform underscores a new era in AI deployment, where the benefits of powerful LLMs can be realized without compromising data privacy and security. Edgeless Systems, a Germany-based cybersecurity company that develops open-source software for confidential computing, is collaborating with NVIDIA to empower businesses across sectors to confidently integrate AI into their operations.

The confidential LLM platform isn’t just a technological advancement—it’s a pivotal step towards a future where organizations can securely utilize AI, even for the most sensitive data.

The Continuum technology has two main security goals. It first protects the user data and also protects AI model weights against the infrastructure, the service provider, and others. Infrastructure includes the basic hardware and software stack that the given AI app runs on. This includes all of the underlying cloud platforms, as well. In the case of ChatGPT, this would be Microsoft Azure. The service provider is the entity that provides and controls the actual AI app. In the case of ChatGPT, this would be OpenAI.

How Continuum works

Continuum relies on two core mechanisms: confidential computing and advanced sandboxing. Confidential computing is a hardware-based technology that keeps data encrypted even during processing. Further, confidential computing makes it possible to verify the integrity of workloads..

Confidential Computing, powered by NVIDIA H100 Tensor Core GPUs and advanced sandboxing technology, enables customers to protect user data and AI models. It does this by creating a secure environment that separates the infrastructure and service provider from the data and models. This technology also includes popular AI inference services, like NVIDIA Triton Inference Server.

Continuum represented by green boxes, protection for prompts and the corresponding responses through encryption when they travel to and from the AI model. — *Figure 1. A workflow showing the encrypted prompts when using the model*

Even with these security mechanisms in place, the AI code will likely come from a third party, which could accidentally or maliciously leak prompts, such as writing the prompt to the disk or the network in plaintext.

One solution is to review the AI code thoroughly. However, due to the complexity and regular updates to AI code, this is impractical.

Continuum addresses the problem by running the AI code inside a sandbox on the confidential computing-protected AI worker. In general terms, a sandbox is an environment that prevents an application from interacting with the rest of a system. It runs the AI code inside an adapted version of Google’s gVisor sandbox. This ensures that the AI code has no means to leak prompts and responses in plaintext. The only thing the AI code can do is receive encrypted prompts, query the accelerator, and return encrypted responses.

With this architecture in place, your prompts are even protected from the entity that provides the AI code. In simplified terms, in the case of the well-known ChatGPT, this means that you wouldn’t have to trust OpenAI (the company that provides the AI code) or Microsoft Azure (the company that runs the infrastructure).

Architecture

Continuum consists of two parts: the server side and the client side. The server side hosts the AI service and processes prompts securely. The client-side verifies the server, encrypts the prompts, and sends inference requests. Let’s dive deeper into the components, how they interact, and details on their respective roles.

The server side hosts the inference service. Its architecture includes two main components: the workers and the attestation service.

The worker node is central to the backend. It hosts an AI model and serves inference requests. The necessary inference code and model are provided externally by the inference and model owner. The containerized inference code, called AI code, runs in a secure environment.

Each worker is a confidential VM (CVM) running Continuum OS. This OS is minimal, immutable, and verifiable through remote attestation. Continuum OS hosts workloads in a sandbox and mediates network traffic through an encryption proxy.

The worker provides an HTTPS API to manage (start and stop) AI code containers.

AI code sandbox

The AI code, provided by the inference owner, runs in a gVisor sandbox. This sandbox isolates the AI code from the host, handling system calls in a userspace kernel and blocking network traffic to prevent data leaks.

Encryption proxy

Each AI code has an attached proxy container, which is its only connection to the outside world. The proxy manages prompt encryption on the client side. It decrypts incoming requests and sends them to the sandbox. In the opposite direction, it encrypts responses and sends them back to the user. The proxy supports various API adapters, such as OpenAI or Triton Generate.

Attestation service

The attestation feature of CVMs ensures the integrity and authenticity of workers. This enables both the service provider and clients to verify the workers’s integrity and that they are interacting with a benign deployment.

The attestation service (AS) is centrally managed. On the server side, the AS verifies each worker based on its attestation statement. On the client side, the AS provides a system-wide attestation endpoint and handles key exchanges for prompt encryption.

The AS runs in a Confidential Virtual Machine (CVM). During initialization, the service provider uses the Continuum CLI to establish trust by verifying the AS attestation report.

Workflow

In Figure 2, the flow details how admins verify the attestation services integrity through the CLI. Upon successful verification, the admin sets the manifest using the CLI. Interacting directly with the workers, they configure the AI code using the worker API.

Workers register with the AS, which verifies their attestation reports. Verified workers receive inference secrets and can then serve inference requests.

Users interact directly with the AS and the workers or through a trusted web service. Users verify the deployment using the AS and set their inference secrets. Then they can send encrypted prompts to the service. The encryption proxy decrypts these prompts, forwards them to the sandbox, re-encrypts the responses, and sends them back to the user.