El Reg's essential guide to deploying LLMs in production

(2025/04/22)

Reference: 1745322308
News link: https://www.theregister.co.uk/2025/04/22/llm_production_guide/
Source link:

Hands On You can spin up a chatbot with Llama.cpp or Ollama in minutes, but scaling large language models to handle real workloads – think multiple users, uptime guarantees, and not blowing your GPU budget – is a very different beast.

While a model might run efficiently on your PC with less than 4 GB of memory, the resources required to serve a model at scale are quite different. Deploying it in a production environment to handle numerous concurrent requests can require 40GB of GPU memory or more.

Just looking to get your hands dirty with LLMs? Our hands-on guide will get you [1]up and running in less than 10 minutes.

In this hands-on guide, we'll be taking a closer look at the avenues for scaling your AI workloads from local proofs of concept to production-ready deployments, and walk you through the process of deploying models like Gemma 3 or Llama 3.1 at scale.

Building with APIs

There are plenty of ways to integrate LLMs into your code base, but when it comes to deploying them in production, we strongly recommend using an OpenAI-compatible API.

This gives you flexibility to keep up with the rapidly changing model landscape. Models considered state of the art just six months ago are already ancient history. And since ChatGPT kicked off the AI boom in 2022, OpenAI's API interface has become the de facto standard for connecting apps to LLMs.

This approach also means you can start building applications that leverage LLMs using whatever's available. For example, you can start building with Mistral 7B in Llama.cpp on your notebook and then swap it out for Mistral AI's API servers when you're ready to deploy it in production. You're not stuck with one model, inference engine, or API provider.

[2]

Speaking of cloud-based inference services, this is usually the most capex-friendly way to scale up your AI deployments. There's no hardware to manage and no models to configure, just an API to point your app at.

[3]

[4]

In addition to the major model builders' API offerings, a growing pack of AI infra startups now offer inference-as-a-service for open-weight models.

However, not all of these providers are created equal. Some, like SambaNova, Cerebras, and Groq, use boutique hardware or techniques like [5]speculative decoding to speed up inference, but offer a smaller catalog of models.

[6]

Others, like Fireworks AI, support the deploying custom fine-tuned models using Low Rank Adaptation (LoRA) adapters — a concept we've [7]explored in depth previously . Given how diverse the AI ecosystem has become over the past two years, you'll want to do your research before committing to one over another.

So you want (or have) to do it yourself

If cloud-based approaches are off the table — whether it's for privacy or regulatory reasons or because your company already ordered a bunch of GPU servers — then you'll be looking at deploying on-prem. In that case, things can get tricky in a hurry. Let's start with some of the more common questions you're bound to run into.

What model should I use? This is going to depend on your use case. A model used primarily for a customer service chatbot is going to have different requirements than one used for retrieval augmented generation, or as a code assistant.

If you haven't already settled on a model, it's worth spending some time with API providers until you're confident you've found a model that will meet your needs.

How much hardware do I really need? This is a big question. GPUs are expensive and still in short supply. Thankfully, your model can tell you an awful lot about what it's going to take to run it.

[8]

Bigger models obviously need more hardware. But how much more? Well, you can get a rough estimate of the minimum GPU memory required by multiplying the parameter count (in billions) by 2GB.

For example, if you want to run a model like Llama 4 Maverick, you'll need a rough minimum of 800GB of GPU memory just to hold the weights. So you're likely looking at Nvidia HGX H200, B200 or AMD MI300X boxes from the get go. But if you're already getting good results with something like Phi 4 14B, then you may be able to get away with a pair of Nvidia L40s.

Note that this only applies to models trained at 16-bit precision. For 8-bit models, like DeepSeek-V3 or R1, you'll need just 1GB per billion parameters. And if you're willing to use model compression techniques like quantization, which trade quality for size, you can get this down to 512MB per billion parameters.

However, this is only a lower limit. In most cases, you'll need a fair bit more memory if you plan to serve that model to more than one person at a time. That's because in addition to the model's weights, you also have to account for the key-value cache. You can think of it like the model's short-term memory, and the more users and details it needs to keep track of, the more HBM or GDDR you'll need to budget for.

Nvidia's [9]support matrix is a good place to start, as it offers some general guidance on which and how many GPUs you'll need to run many of the more popular open models.

Once you've sized your hardware to your model, you'll also need to think about redundancy. While you can certainly configure a single GPU node to run LLMs for business, what happens if it fails? So when sizing your deployment, you'll likely be looking at two more systems for failover and load balancing.

How should I deploy my model? There are several ways to deploy and serve LLMs in production. They can run on bare metal with old-fashioned load balancers distributing the load across them. You can deploy them in virtual machines, or you can spin up containers in Docker or Kubernetes and manage everything that way.

In this guide, we'll be looking at deploying LLMs using Kubernetes, as it neatly abstracts away much of the complexity associated with large scale deployments by automating container creation, networking, and load balancing.

[10]

How to spread an inference workload across three GPU nodes, to distribute the load and mitigate outages or downtime across the cluster ... Click to enlarge any image

That's not to say Kubernetes is easy. It's an whole can of worms unto itself, but it's one that many enterprises have already adopted and understand fairly well.

This is one of the reasons we've seen Nvidia, Hugging Face, and others gravitate toward containerized environments with their respective Nvidia [11]Inference Microservices (NIMs) and adorably named [12]HUGS (Hugging Face Generative AI Services), which have been preconfigured for common workloads and deployments. We'll take a look at how to use these a little later in this piece.

Which inference engine should I use? There's no shortage of inference engines for running models. We frequently use Ollama and Llama.cpp in our coverage as they'll work on just about anything, even a Raspberry Pi.

However, if your goal is to serve models at scale, then we tend to see libraries like vLLM, TensorRT LLM, SGLang and even PyTorch used more often.

For the purposes of this tutorial, we'll be looking at deploying models using [13]vLLM , as it supports a wide selection of popular models and offers broad support and compatibility across Nvidia, AMD, and other hardware. However if you prefer to use something like SGLang or TensorRT LLM, most of the principals described here still apply.

Preparing our Kubernetes environment

Compared to your typical Kubernetes environment, getting pods to talk to your GPUs requires some additional drivers and dependencies. The process for setting this up is going to look different for AMD hardware than it would for Nvidia.

To keep things simple, we'll be deploying K3S in a single-node configuration. The basic steps are largely similar to multi-node environments, you'll just need to ensure the dependencies are satisfied on each GPU worker node, and may need some tweaks to the storage configuration.

[14]

In this guide, we'll be deploying vLLM on a simplified single-node Kubernetes cluster

Note: This is not intended to be an exhaustive guide on container orchestration and assumes some familiarity with Kubernetes concepts. Our goal is to provide a solid foundation for deploying inference workloads in a production-friendly manner. So while we will be walking through the steps to configure a GPU-accelerated Kubernetes node, your environment may look slightly different.

Prerequisites

For this guide you'll need:

A server or workstation with at least one supported AMD or Nvidia GPU board.

A fresh install of Ubuntu 24.04 LTS in order to keep dependencies manageable.

Nvidia dependencies

Setting up an Nvidia-accelerated K3S environment is relatively straightforward, but does involve a handful of dependencies.

We'll start by installing the CUDA Drivers Fabric Manager and Headless server drivers. In our case, the latest release available in the LTS repos is 570 so we'll go with that. We'll also install Nvidia's server utils as they are helpful for debugging any driver issues that crop up. sudo apt install -y cuda-drivers-fabricmanager-570 nvidia-headless-570-server nvidia-utils-570-server

Next we'll add Nvidia's container runtime repo and pull down the latest version by running the following commands and then rebooting. curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \

&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \

sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \

sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-runtime

sudo reboot

Spinning Up K3S

With most of our dependencies set, we can move on to deploying K3S. For a single node deployment, this can be achieved by running: curl -sfL https://get.k3s.io | sh -

Once the process completes, we can check to see if our node is ready to run on the machine with: k3s kubectl get nodes

If everything worked correctly, you should see something like this: NAME STATUS ROLES AGE VERSION

kube-nv-prod Ready control-plane,master 3h45m v1.32.3+k3s1

If you're running an AMD Instinct box you can move on to the next section, but if you're using Nvidia there are few more tweaks we need to make.

By default, K3S will detect that the Nvidia container runtime is installed and configure itself accordingly. However, it's best to double check by running grep . If everything worked properly you see the nvidia-container-runtime listed below. sudo grep nvidia /var/lib/rancher/k3s/agent/etc/containerd/config.toml

sudo systemctl restart k3s

Deploying the GPU device driver

With K3S set up, we just need to give Kubernetes permission to allocate GPU resources to our pods. To do this we need to deploy the respective AMD or Nvidia device driver daemon to the cluster.

For Nvidia, this can be achieved by running: kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml

AMD users, meanwhile, can install the drivers by running the following two commands: kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-dp.yaml

kubectl create -f https://raw.githubusercontent.com/ROCm/k8s-device-plugin/master/k8s-ds-amdgpu-labeller.yaml

Note: All the commands we're showing here can be run directly on the server, but if you prefer to access your cluster remotely via kubectl , you can find the K3S config.yaml file under /etc/rancher/k3s/k3s.conf

After about a minute, we can check that our GPU has been detected by running: kubectl describe node

Having trouble getting your GPUs to show up? Check out AMD and Nvidia's respective support docs at the links below:

[15]NVIDIA device plugin for Kubernetes

[16]AMD GPU Device Plugin for Kubernetes

If everything worked correctly, you should now see amd.com/gpu: 1 or nvidia.com/gpu: 1 under the Allocatable header.

With that, your GPU nodes should be ready for deployment and you can move on to Deploying Gemma 3 4B in vLLM (Nvidia or AMD) or Getting started with NIMs (Nvidia).

Deploying Gemma 3 4B in vLLM

To get started we'll be looking at how to deploy Gemma 3 4B using vLLM on a Nvidia L40S or AMD MI210-class part, as well as how you'd tweak your deployment to run on a larger eight-GPU.

To get started, let's take a look at the manifest file.

vllm-openai-gemma.yaml (Nvidia) apiVersion: apps/v1

kind: Deployment

metadata:

name: vllm-openai

labels:

app: vllm-openai

spec:

replicas: 1

selector:

matchLabels:

app: vllm-openai

template:

metadata:

labels:

app: vllm-openai

spec:

runtimeClassName: nvidia

hostIPC: true

containers:

- name: vllm-openai

image: vllm/vllm-openai:latest

args:

- "--model"

- "google/gemma-3-4b-it"

- "--served-model-name"

- "Gemma 3 4B"

- "--disable-log-requests"

- "-tp"

- "1"

- "--max-num-seqs"

- "8"

#- "--max_num_batched_tokens"

#- "16000"

- "--max_model_len"

- "16000"

- "--api-key"

- "$(API_KEY)"

ports:

- containerPort: 8000

env:

- name: API_KEY

valueFrom:

secretKeyRef:

name: vllm-api-key

key: API_KEY

- name: HUGGING_FACE_HUB_TOKEN

valueFrom:

secretKeyRef:

name: hf-token

key: HUGGING_FACE_HUB_TOKEN

volumeMounts:

- name: huggingface-cache

mountPath: /root/.cache/huggingface

- name: shm

mountPath: /dev/shm

volumes:

- name: huggingface-cache

hostPath:

path: /root/.cache/huggingface

type: Directory

- name: shm

emptyDir:

medium: Memory

sizeLimit: "10Gi"

resources:

limits:

nvidia.com/gpu: 1

cpu: "12"

memory: 20G

requests:

nvidia.com/gpu: 1

cpu: "8"

memory: 10G

As manifests go, this one is pretty standard. It spins up a single pod running the Gemma3 4B model in vLLM on a single Nvidia GPU — we've included an AMD-specific manifest below, but same basic concepts will apply to both.

With that said, there are a few elements — other than the vLLM arguments, which we'll get to in a bit — worth highlighting.

Replicas: The number of replicas dictates how many instances of the model we want to deploy. If you've got enough GPUs, you could have several instances of vLLM load balanced across a single server. For smaller models, like Gemma 3 4B, this is actually what AMD recommends.

Resources: Defines how much CPU, GPU, and system memory should be allocated to each pod. These need to be adjusted based on the size of your model.

If your LLM can't fit on a single GPU, you'll want to increase the number of GPUs it should request. If for example, you wanted to run DeepSeek R1 on an H200 or MI300X box with eight GPUs, you'd set the GPU limit and request both to 8.

vLLM configuration

Now, let's move on to the vLLM config itself. As you can see from the manifest, we've passed quite a few arguments to vLLM describing how we want to run the model. Some of these are required, like setting the model location — in this case it's pulling from Hugging Face — as well as the advertised model name, flags to disable log-requests, and the API key which we will pass along via a Kubernetes secret in a bit.

The rest will depend heavily on your specific hardware and use case. So let's dig into each.

Check out vLLM's support docs [17]here for more information on tuning your inference deployments.

tensor-parallel-size: Tensor parallelism enables us to distribute both the model weights and computational load across multiple GPUs.

This number should generally match the number of GPUs under the resources section discussed above. In our case, since we're demonstrating on a single-GPU node, this will be set to 1 , which means it's effectively disabled.

max-num-seqs: Sets the upper limit for how many prompts or requests vLLM should process in a single batch. So if you set this to 8, vLLM will process at most 8 concurrent requests at once.

It can be more computationally efficient to set this to a higher number, but comes with the tradeoffs of higher memory consumption and longer wait times for responses to start streaming in.

So, if you know your LLM will be mostly handling a handful of requests at any given moment, it can actually be better to set this lower to reduce latency and memory consumption.

max_num_batched_tokens: Behaves similarly to --max-num-seqs but rather than concurrent requests, caps the maximum number of tokens vLLM should process at any one time.

The proper setting here will depend on whether you're optimizing for latency or throughput.

A smaller batch size puts more computational load on the system but also reduces memory consumption and latency. A bigger batch size, meanwhile, allows more tokens to be processed at once, allowing it to serve more users, but potentially increasing their wait time.

If you're unsure how to set this, you can actually omit it from the config and vLLM will automatically adjust the batch size based on the available resources.

max_model_len: This defines the maximum number of tokens each sequence should keep track of, and goes hand in hand with the --max_num_seqs we set earlier.

You can think of this a bit like a workbench. The --max_model_len describes how big each workbench is — or rather can be — and the --max_num_seqs describe how many workbenches are available for folks to use at any given moment. For a given space (memory) you can either have fewer bigger workbenches or loads of tiny ones.

vLLM will, by default, set this to the maximum context window supported by the model. On older models this was usually small, but these days, context windows routinely exceeded 128,000 tokens and some are now pushing a million-plus. Those tokens can take up a lot of memory, so, it's often necessary or even desirable to set the --max_model_len to a smaller value.

The graphic below details how these parameters impact memory requirements:

[18]

How max_num_batch_tokens, max_model_len, and max_num_seqs impact GPU memory

vllm-openai-gemma.yaml (AMD)

As we mentioned earlier, if you're deploying workloads on AMD Instinct accelerators, the manifest file is going to look a little different. apiVersion: apps/v1

kind: Deployment

metadata:

name: vllm-openai

spec:

replicas: 1

selector:

matchLabels:

app: vllm-openai

template:

metadata:

labels:

app: vllm-openai

spec:

hostIPC: true

containers:

- name: vllm-openai

image: rocm/vllm:instinct_main

securityContext:

seccompProfile:

type: Unconfined

capabilities:

add:

- SYS_PTRACE

command: ["/bin/sh", "-c"]

args:

-

vllm serve google/gemma-3-4b-it \

--served-model-name 'Gemma 3 4B' \

--disable-log-requests \

-tp 1 \

--max-num-seqs 8 \

--max_model_len 16000 \

--api-key $(API_KEY)

ports:

- containerPort: 8000

env:

- name: VLLM_USE_TRITON_FLASH_ATTN

value: "0"

- name: API_KEY

valueFrom:

secretKeyRef:

name: vllm-api-key

key: API_KEY

- name: HUGGING_FACE_HUB_TOKEN

valueFrom:

secretKeyRef:

name: hf-token

key: HUGGING_FACE_HUB_TOKEN

volumeMounts:

- name: huggingface-cache

mountPath: /root/.cache/huggingface

- name: shm

mountPath: /dev/shm

resources:

limits:

amd.com/gpu: "1"

cpu: "12"

memory: "20Gi"

requests:

amd.com/gpu: "1"

cpu: "8"

memory: "10Gi"

volumes:

- name: huggingface-cache

hostPath:

path: /home/tobiasmann/.cache/huggingface

type: Directory

- name: shm

emptyDir:

medium: Memory

sizeLimit: "10Gi"

Note: This manifest is specific to running vLLM on Instinct accelerators. If you'd hoped to test things out on AMD workstation cards like the W7900, you'll need to build a compatible vLLM container from source by following this guide [19]here , and then push it to your container registry of choice.

Tuning vLLM

While Gemma 3 4B is a pretty small model, requiring a little over 8GB of vRAM to fit the model weights at their native BF16 resolution, trying to use the manifest we just looked at on a 24GB Nvidia L4 will likely result in an out of memory (OOM) error.

If that's the case for you, you'll need to either lower the --max_num_seq , --max_model_len , or --max_num_batch_tokens , or some combination of the three.

These parameters can have a big impact on performance and user experience, so let's take a look at two different scenarios to see how you tune them differently.

Scenario 1: Corporate summarization assistant

[20]

How to set up a summarization assistant

Let's say you're building a genAI application to help search and summarize large documents for a team of 12. The likelihood that you'll have more than two people running a summarization task simultaneously will be pretty low, so we can set --max_num_seqs to something like two.

However because these documents are so big, you'll want to set --max_model_len and --max_num_batch_tokens large enough that the document fits within the context window without getting cut off. If your largest document is around 15,000 words, then you might want to set your --max_model_len to 32,000 to give yourself a buffer — remember that tokens not only represent words but punctuation marks too.

Intuitively, you might think if you have two 32,000 token sequences, you'd want to set your --max_num_batch_tokens to 64,000. However, that's a worse-case scenario, since not every document is going to be 15,000-plus words. For example, if one user summarized a 10,000-word doc and the other a 3,000-word doc, you wouldn't get anywhere close to the cap. Thankfully, vLLM takes a lot of the guesswork out of setting this parameter. If --max_num_batch_tokens is left unset it'll automatically rightsize itself based on the available memory.

Scenario 2: Customer service chatbot

[21]

For a customer service chatbot, prioritize a larger number of smaller sequences to maximize the number of concurrent users

Now, let's imagine you're building a customer service chatbot to help users learn about your product or overcome common challenges. Your parameters are going to look quite a bit different from the summarization scenario.

For one, the prompts and responses are going to be much shorter, but you'll likely be serving a larger number of concurrent users. In this case it might make sense to have a large --max_num_seqs like 16, 32 or more but a smaller --max_model_len of 1,024 or 2,048.

And again here, we can let vLLM figure out how to set our --max_num_batch_tokens for us.

Check out the vLLM docs for more information on benchmarking [22]here .

Benchmarking

Depending on your specific use case, it'll likely be prudent to benchmark a couple of different configurations until you find one that achieves the desired combination of overall throughput, time-to-first token (the time folks have to wait for the chatbot to start responding), or second-token latencies (how quickly the answer generates).

Final preparations: With our vLLM manifest tuned to our liking, we can move on to spinning up vLLM. But first, we'll need to generate two secrets. One is for our Hugging Face token — Gemma 3 is a gated model, so don't forget to [23]request access to the repo page first — and the second you'll use to access the vLLM API server later.

This can be achieved by running the following two commands swapping out HUGGING_FACE_TOKEN_HERE and TOP_SECRET_KEY_HERE for your token and key. kubectl create secret generic hf-token --from-literal=HUGGING_FACE_HUB_TOKEN=HUGGING_FACE_TOKEN_HERE

kubectl create secret generic vllm-api-key --from-literal=API_KEY=TOP_SECRET_KEY_HERE

Finally, you'll want to make sure the Hugging Face cache directory exists: mkdir -p /root/.cache/huggingface

Note: To keep things simple, we're using bind mounts, but there's nothing stopping you from using NFS, a persistent volume store, or some other storage interface to maintain the cached Safetensor files from Hugging Face. You could also just remove that particular bind mount entirely. Just remember that if you do, vLLM will need to re-download them each time a new pod is spun up.

Spinning up vLLM

With that out of the way, we can spin up Gemma 3 4B in vLLM by running: kubectl apply -f vllm-openai-gemma.yaml

We can then check the deployment by running a get pods command in kubectl . kubectl get pods

You should see some vllm-openai-... "containerCreating." Once it shows as "running," we can then check the vLLM logs to see if there were any problems. kubectl logs -l app=vllm-openai -f

After a few minutes, you should see: INFO: Application startup complete.

Ingress and load balancing

With the server up and running, you can now configure your Kubernetes ingress and load balancer as you would with any other container deployment. How this is done is going to depend on your specific Kubernetes environment, which ingress controller and load balancer you're using, and your security policies.

The basic idea here is that Kubernetes should expose an API address, and Kubernetes will automatically balance the load across all available GPU nodes.

If you've been following along and just want to make sure the vLLM server is working as intended, you can spin up a quick-and-dirty ingress service by running the commands below. Note that if you are going to replicate this, your front-end application needs to be running on the same subnet behind your network's DMZ, as it's served over standard HTTP. With those precautions out of the way, let's get into it.

We'll start by creating a new ClusterIP service by creating another YAML file called vllm-openai-svc.yaml containing the following: apiVersion: v1

kind: Service

metadata:

name: vllm-openai-svc

namespace: default

spec:

type: ClusterIP

selector:

app: vllm-openai

ports:

- port: 8000

targetPort: 8000

Then to create the service we'll run: kubectl apply -f vllm-openai-svc.yaml

Next we'll load our ingress configuration, creating a separate YAML file called vllm-openai-ingressroute.yaml containing the following, replacing YOUR_DOMAIN_NAME_HERE with the domain you plan to use. apiVersion: traefik.io/v1alpha1

kind: IngressRoute

metadata:

name: vllm-openai

namespace: default

spec:

entryPoints:

- web

routes:

- match: Host(YOUR_DOMAIN_NAME_HERE)

kind: Rule

services:

- name: vllm-openai-svc

port: 8000

sticky:

cookie:

name: VLLMSESSION

secure: true

httpOnly: true

We can then apply it by running: kubectl apply -f vllm-openai-ingressroute.yaml

You can now update your local DNS server to point to the domain you set earlier. Modifying /etc/hosts

If you don't have access to your local DNS server (usually handled as part of the router config or as a standalone server), you can modify your local machine's /etc/hosts file in Linux to point to the domain to your Kuberentes node's IP. sudo nano /etc/hosts

Append the following and then save and exit. NODE_IP YOUR_DOMAIN_NAME_HERE YOUR_DOMAIN_NAME_HERE

Testing it out:

With everything configured we can now test that the server is running by running: export VLLM_API_KEY="TOP_SECRET_KEY_HERE"

curl -i http://api.local.rambler.ink/v1/models \

-H "Authorization: Bearer $VLLM_API_KEY"

If everything worked you should see something along the lines of: HTTP/1.1 200 OK

Content-Length: 481

Content-Type: application/json

Date: Wed, 16 Apr 2025 20:06:56 GMT

Server: uvicorn

Set-Cookie: VLLMSESSION=48de0c44ee42ce61; Path=/; HttpOnly; Secure

{"object":"list","data":[{"id":"Gemma 3 4B","object":"model","created":1744834016,"owned_by":"vllm","root":"google/gemma-3-4b-it","parent":null,"max_model_len":16000,"permission":[{"id":"modelperm-e17493c83c8047c1b8ce3b082e4c4a61","object":"model_permission","created":1744834016,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Getting started with NIMs

The sheer number of levers and knobs necessary to flip and turn in order to achieve optimal throughput or latency is one of the reasons why Nvidia, Hugging Face, and others have gravitated toward pre-baked model containers that require little to no configuration to get up and running with.

Nvidia calls these Inference Microservices (NIMs) for short, while Hugging Face calls their version of these containers Hugs. These microservices aren't free. You can play with NIMs in a dev environment, but if you want to deploy them in production you'll need an AI Enterprise license that'll set you back $4,500/year per GPU or $1/hour per GPU in the cloud.

However, if you're already paying for said license, they're a no-brainer.

We'll go over the basics of deploying an NIM here, but you'll definitely want to check out Nvidia's [24]docs for specifics on how to tune your configuration to best suit your Kubernetes environment.

Grabbing the dependencies

Deploying NIM on our Kubernetes cluster requires a few additional dependencies, namely Helm and the Nvidia GPU operator. To install Helm, we can simply run: curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

&& chmod 700 get_helm.sh \

&& ./get_helm.sh

We can then add the Helm repo for GPU Operator and install it: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \

&& helm repo update

helm install --wait --generate-name \

-n gpu-operator --create-namespace \

nvidia/gpu-operator \

--version=v25.3.0

Adding the NGC Repo and API key

Before you can download any NIMs you'll need to generate an NGC API key by following the instructions [25]here , and add it to your .bashrc or .zsh file. export NGC_API_KEY=NGC_API_KEY_HERE

echo "export NGC_API_KEY=VALUE" ~/.bashrc

or echo "export NGC_API_KEY=VALUE" ~/.zsh

Grab the Helm chart and add your customizations

Next we'll download the NIM LLM Helm chart. We're using version 1.7.0 but you can find the latest version here. helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.7.0.tgz --username='$oauthtoken' --password=$NGC_API_KEY

Next we'll add the NGC repo and API key as secrets on our Kubernetes cluster. kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY

kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_API_KEY

Finally we'll create a configuration file for our deployment, which in this case is Llama 3.1 8B and save it as custom-values.yaml . image:

repository: "nvcr.io/nim/meta/llama3-8b-instruct"

tag: 1.0.3

model:

ngcAPISecret: ngc-api

persistence:

enabled: false

imagePullSecrets:

- name: ngc-secret

Spin it up the NIM helm install my-nim nim-llm-1.7.0.tgz -f custom-values.yaml

After a few minutes your NIM should show up when running, we then test it by forwarding the container port to our machine. kubectl port-forward service/my-nim-nim-llm 8000:http-openai

Then in a separate shell, we can test it by running: curl -i http://localhost:8000/v1/models

If everything works, you should see something along the lines of: HTTP/1.1 200 OK

date: Thu, 17 Apr 2025 21:40:52 GMT

server: uvicorn

content-length: 477

content-type: application/json

{"object":"list","data":[{"id":"meta/llama3-8b-instruct","object":"model","created":1744926052,"owned_by":"system","root":"meta/llama3-8b-instruct","parent":null,"permission":[{"id":"modelperm-47170d15fee9430eb42deda48f0f17b0","object":"model_permission","created":1744926052,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}%

Of course, you'll want to configure a proper ingress service like we did earlier for vLLM but this should at least give you some idea of how NIMs can be deployed in production.

Summing up

Regardless of which route you take to scaling your AI workloads in production, it's important to prioritize flexibility without compromising on resiliency and security.

If you enjoyed this story, you might like our other hands-on AI coverage:

[26]From RAGs to riches: A practical guide to making your local AI chatbot smarter

[27]Everything you need to know to start fine-tuning LLMs in the privacy of your home

[28]Cheat codes for LLM performance: An introduction to speculative decoding

[29]A quick guide to tool-calling in large language models

[30]A friendly guide to containerization for AI work

[31]Everything you need to get up and running with MCP – Anthropic's USB-C for AI

It's also worth noting that inference, while an essential piece of the AI puzzle, is one of many, and LLMs on their own are only so useful.

As we've previously [32]discussed , building effective AI tools may require multiple technologies and approaches, including fine-tuning, retrieval augmented generation, and, usually, a good bit of data prep.

The Register aims to bring you more on using LLMs and other AI technologies – without the hype – soon. We want to pull back the curtain and show how this stuff really fits together. If you have any burning questions on AI infrastructure, software, or models, we'd love to hear about them in the comments section below. ®

Editor's Note: The Register was provided an RTX 6000 Ada Generation graphics card by Nvidia, an Arc A770 GPU by Intel, and a Radeon Pro W7900 DS by AMD to support stories like this. None of these vendors had any input as to the content of this or other articles.

Get our [33]Tech Resources

[1] https://www.theregister.com/2024/03/17/ai_pc_local_llm/?td=rt-3a

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/aisoftwaredevelopmentweek&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aAe9MDQbt4g4drLco6-wgQAAARM&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/aisoftwaredevelopmentweek&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aAe9MDQbt4g4drLco6-wgQAAARM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/aisoftwaredevelopmentweek&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aAe9MDQbt4g4drLco6-wgQAAARM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://www.theregister.com/2024/12/15/speculative_decoding/

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/aisoftwaredevelopmentweek&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aAe9MDQbt4g4drLco6-wgQAAARM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://www.theregister.com/2024/11/10/llm_finetuning_guide/

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_specialfeatures/aisoftwaredevelopmentweek&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aAe9MDQbt4g4drLco6-wgQAAARM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[9] https://docs.nvidia.com/nim/large-language-models/latest/supported-models.html

[10] https://regmedia.co.uk/2025/04/18/kubernetes_gpu_example.png

[11] https://www.theregister.com/2024/03/19/nvidia_why_write_code_when/

[12] https://www.theregister.com/2024/10/24/huggingface_hugs_nvidia/

[13] https://github.com/vllm-project/vllm

[14] https://regmedia.co.uk/2025/04/18/kubernetes_gpu_demo.png

[15] https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file

[16] https://github.com/ROCm/k8s-device-plugin

[17] https://docs.vllm.ai/en/stable/

[18] https://regmedia.co.uk/2025/04/18/model_memory.png

[19] https://docs.vllm.ai/en/stable/getting_started/installation/gpu.html?device=rocm#set-up-using-docker

[20] https://regmedia.co.uk/2025/04/18/vllm-summary-example.png

[21] https://regmedia.co.uk/2025/04/18/vllm_customer_service_example.png

[22] https://github.com/vllm-project/vllm/blob/main/benchmarks/README.md

[23] https://huggingface.co/google/gemma-3-4b-it

[24] https://docs.nvidia.com/nim/large-language-models/1.8.0/deploy-helm.html

[25] https://docs.nvidia.com/nim/large-language-models/1.8.0/getting-started.html

[26] https://www.theregister.com/2024/06/15/ai_rag_guide/

[27] https://www.theregister.com/2024/11/10/llm_finetuning_guide/

[28] https://www.theregister.com/2024/12/15/speculative_decoding/

[29] https://www.theregister.com/2024/08/26/ai_llm_tool_calling/

[30] https://www.theregister.com/2024/07/07/containerize_ai_apps/

[31] https://www.theregister.com/2025/04/21/mcp_guide/

[32] https://www.theregister.com/2025/02/23/aleph_alpha_sovereign_ai/

[33] https://whitepapers.theregister.com/

Impressive

Rich 2

That’s a very impressive and comprehensive list of how to get an LLM up and running

The one item you seem to have missed out is the one under the heading “What do I do with it?”

Re: Impressive

Anonymous Coward

the better question is "why the fuck would I want to use shitty hallucinating AI for anything?"

hallucinating is actually a bad description, it's more like just making random shit up due to edge case probability. (AI output is basically throwing random numbers to predict what comes next, it's a fancy markov chain type behaviour)

Re: Impressive

m4r35n357

Still gagging on "2GB per parameter". Give me a while to catch my breath . . .

Ace2

“Don’t”

News: 1745322308

El Reg's essential guide to deploying LLMs in production

Impressive

Re: Impressive

Re: Impressive