LLM Inference & Serving Lab

Overview

Training an LLM is only half the job — serving it efficiently in production is the other half. The LLM Inference Lab gives you a real deployment environment with an NVIDIA L4 GPU (the cost-efficient inference GPU from Google Cloud) and a GKE cluster to deploy models as scalable API endpoints. You'll learn to deploy models with vLLM (the fastest open-source LLM serving engine), apply quantization to reduce memory and cost, build FastAPI wrappers, and run load tests to understand throughput and latency tradeoffs. This is the exact workflow used by production AI teams to serve language models at scale.

What You'll Do in This Lab

Deploy fine-tuned LLMs as REST APIs using vLLM and FastAPI
Serve models on Vertex AI Endpoints with auto-scaling configuration
Apply GPTQ, AWQ, and GGUF quantization and measure quality vs speed tradeoff
Run load tests with Locust — measure tokens/second, latency P50/P95/P99
Implement streaming responses for real-time chat applications
Configure model caching, batching, and prefix caching for cost optimization

Lab Workflow

1

Launch

Start the serving lab. A GKE cluster with an NVIDIA L4 GPU node and a Vertex AI Workbench control plane spin up for you.

2

Load Model

Pull a model from Cloud Storage or Hugging Face Hub. Apply quantization if needed (GPTQ, AWQ, or GGUF).

3

Deploy

Deploy with vLLM or TGI to your GKE cluster. Configure batch size, max concurrent requests, and tensor parallelism.

4

Test

Send sample requests via curl or the provided test client. Verify correct outputs, streaming behavior, and structured output formatting.

5

Benchmark

Run Locust load tests. Generate throughput charts, latency distributions, and cost-per-token estimates.

6

Optimize

Tune serving parameters, enable prefix caching, adjust batch sizes. Re-benchmark and compare results.

Hardware & Environment

Serving GPU	NVIDIA L4 (24 GB VRAM, Ada Lovelace architecture)
Machine Type	g2-standard-12 (12 vCPU, 48 GB RAM)
GKE Cluster	Autopilot with GPU node pool
Serving Frameworks	vLLM 0.4+, TGI 2.0+
Max Model Size	7B FP16, 13B GPTQ-4bit, 70B AWQ-4bit
Session Length	2-3 hour sessions

Frequently asked questions about this lab

What is the LLM Inference & Serving Lab? +

Environment for deploying, serving, and benchmarking LLM inference. Students learn to optimize serving throughput, configure quantized model deployment, and build production API endpoints.

Which courses use this lab? +

This lab is included in: NLP & Large Language Model Engineering, MLOps & AI Infrastructure.

What hardware does this lab run on? +

Vertex AI Endpoints / GKE with GPU node pools. Serving GPU: NVIDIA L4 (24 GB VRAM, Ada Lovelace architecture); Machine Type: g2-standard-12 (12 vCPU, 48 GB RAM); GKE Cluster: Autopilot with GPU node pool; Serving Frameworks: vLLM 0.4+, TGI 2.0+.

What software comes pre-installed? +

Comes pre-loaded with vLLM, Text Generation Inference (TGI), Vertex AI Model Registry, GGUF / GPTQ / AWQ quantization, FastAPI, Locust (load testing). No local installs or dependency setup required — open your browser and start working.

Can I bring my own datasets and code into this lab? +

Yes. Datasets can be uploaded directly or synced from Google Cloud Storage. Notebooks and source files have built-in Git integration so you can push work to your own GitHub or GitLab repos.

Do I need to enroll in a course to use this lab? +

Yes. Lab environments are provisioned per-student as part of an AI Labs course enrollment. Browse the courses linked above to find programs that include this lab.

Related labs

Other AI Labs environments students typically use alongside this one.

Ready to Try This Lab?

Enroll in a course that uses this lab, or visit our Houston center for a hands-on demo.

Browse Courses View All Labs