🚀

LLM Inference & Serving Lab

Vertex AI Endpoints / GKE with GPU node pools

Overview

Training an LLM is only half the job — serving it efficiently in production is the other half. The LLM Inference Lab gives you a real deployment environment with an NVIDIA L4 GPU (the cost-efficient inference GPU from Google Cloud) and a GKE cluster to deploy models as scalable API endpoints. You'll learn to deploy models with vLLM (the fastest open-source LLM serving engine), apply quantization to reduce memory and cost, build FastAPI wrappers, and run load tests to understand throughput and latency tradeoffs. This is the exact workflow used by production AI teams to serve language models at scale.

What You'll Do in This Lab

  • Deploy fine-tuned LLMs as REST APIs using vLLM and FastAPI
  • Serve models on Vertex AI Endpoints with auto-scaling configuration
  • Apply GPTQ, AWQ, and GGUF quantization and measure quality vs speed tradeoff
  • Run load tests with Locust — measure tokens/second, latency P50/P95/P99
  • Implement streaming responses for real-time chat applications
  • Configure model caching, batching, and prefix caching for cost optimization

Lab Workflow

1

Launch

Start the serving lab. A GKE cluster with an NVIDIA L4 GPU node and a Vertex AI Workbench control plane spin up for you.

2

Load Model

Pull a model from Cloud Storage or Hugging Face Hub. Apply quantization if needed (GPTQ, AWQ, or GGUF).

3

Deploy

Deploy with vLLM or TGI to your GKE cluster. Configure batch size, max concurrent requests, and tensor parallelism.

4

Test

Send sample requests via curl or the provided test client. Verify correct outputs, streaming behavior, and structured output formatting.

5

Benchmark

Run Locust load tests. Generate throughput charts, latency distributions, and cost-per-token estimates.

6

Optimize

Tune serving parameters, enable prefix caching, adjust batch sizes. Re-benchmark and compare results.

Hardware & Environment

Serving GPU NVIDIA L4 (24 GB VRAM, Ada Lovelace architecture)
Machine Type g2-standard-12 (12 vCPU, 48 GB RAM)
GKE Cluster Autopilot with GPU node pool
Serving Frameworks vLLM 0.4+, TGI 2.0+
Max Model Size 7B FP16, 13B GPTQ-4bit, 70B AWQ-4bit
Session Length 2-3 hour sessions

Pre-installed Tools

vLLM Text Generation Inference (TGI) Vertex AI Model Registry GGUF / GPTQ / AWQ quantization FastAPI Locust (load testing)

Frequently asked questions about this lab

What is the LLM Inference & Serving Lab? +
Environment for deploying, serving, and benchmarking LLM inference. Students learn to optimize serving throughput, configure quantized model deployment, and build production API endpoints.
Which courses use this lab? +
This lab is included in: NLP & Large Language Model Engineering, MLOps & AI Infrastructure.
What hardware does this lab run on? +
Vertex AI Endpoints / GKE with GPU node pools. Serving GPU: NVIDIA L4 (24 GB VRAM, Ada Lovelace architecture); Machine Type: g2-standard-12 (12 vCPU, 48 GB RAM); GKE Cluster: Autopilot with GPU node pool; Serving Frameworks: vLLM 0.4+, TGI 2.0+.
What software comes pre-installed? +
Comes pre-loaded with vLLM, Text Generation Inference (TGI), Vertex AI Model Registry, GGUF / GPTQ / AWQ quantization, FastAPI, Locust (load testing). No local installs or dependency setup required — open your browser and start working.
Can I bring my own datasets and code into this lab? +
Yes. Datasets can be uploaded directly or synced from Google Cloud Storage. Notebooks and source files have built-in Git integration so you can push work to your own GitHub or GitLab repos.
Do I need to enroll in a course to use this lab? +
Yes. Lab environments are provisioned per-student as part of an AI Labs course enrollment. Browse the courses linked above to find programs that include this lab.

Related labs

Other AI Labs environments students typically use alongside this one.

Ready to Try This Lab?

Enroll in a course that uses this lab, or visit our Houston center for a hands-on demo.