Vertex AI Endpoints / GKE with GPU node pools
Training an LLM is only half the job — serving it efficiently in production is the other half. The LLM Inference Lab gives you a real deployment environment with an NVIDIA L4 GPU (the cost-efficient inference GPU from Google Cloud) and a GKE cluster to deploy models as scalable API endpoints. You'll learn to deploy models with vLLM (the fastest open-source LLM serving engine), apply quantization to reduce memory and cost, build FastAPI wrappers, and run load tests to understand throughput and latency tradeoffs. This is the exact workflow used by production AI teams to serve language models at scale.
Start the serving lab. A GKE cluster with an NVIDIA L4 GPU node and a Vertex AI Workbench control plane spin up for you.
Pull a model from Cloud Storage or Hugging Face Hub. Apply quantization if needed (GPTQ, AWQ, or GGUF).
Deploy with vLLM or TGI to your GKE cluster. Configure batch size, max concurrent requests, and tensor parallelism.
Send sample requests via curl or the provided test client. Verify correct outputs, streaming behavior, and structured output formatting.
Run Locust load tests. Generate throughput charts, latency distributions, and cost-per-token estimates.
Tune serving parameters, enable prefix caching, adjust batch sizes. Re-benchmark and compare results.
Other AI Labs environments students typically use alongside this one.
Single-GPU environment for training deep learning models, running computer vision pipelines, and experimenting with neural network architect…
Explore lab →High-memory GPU environment purpose-built for fine-tuning large language models. Supports full fine-tuning of 7B+ parameter models and param…
Explore lab →Pre-configured environment for building retrieval-augmented generation systems. Includes vector databases, embedding model APIs, document pr…
Explore lab →Enroll in a course that uses this lab, or visit our Houston center for a hands-on demo.