Data Engineering Lab

Overview

AI models are only as good as the data that feeds them. The Data Engineering Lab gives you a full-stack GCP data platform — BigQuery for analytics, Cloud Composer for orchestration, Dataflow for batch/streaming processing, and Pub/Sub for event-driven pipelines. This isn't a toy environment: you'll work with datasets in the tens of millions of rows, build DAGs that run on schedule, and create streaming pipelines that process events in real-time. The same tools and patterns used by data teams at Google, Spotify, and Airbnb.

What You'll Do in This Lab

Build batch ETL pipelines with Apache Beam running on Dataflow
Create and schedule Airflow DAGs in Cloud Composer
Write dbt models for SQL-based data transformation in BigQuery
Build streaming pipelines with Pub/Sub and Dataflow Streaming
Implement feature stores for ML with Vertex AI Feature Store
Set up data quality checks with Great Expectations

Lab Workflow

1

Launch

Start the lab. Your Cloud Composer environment, BigQuery datasets, Pub/Sub topics, and Cloud Storage buckets are provisioned.

2

Explore

Examine the source data in BigQuery and Cloud Storage. Understand schemas, data quality issues, and transformation requirements.

3

Build

Write your data pipeline — Beam for ETL, dbt for transformations, Airflow DAGs for orchestration. Deploy to Cloud Composer.

4

Run

Trigger your pipeline. Watch Dataflow jobs scale workers, monitor Airflow task execution, and verify BigQuery outputs.

5

Validate

Run Great Expectations validation suites against your output tables. Check for schema correctness, null rates, and value distributions.

6

Monitor

Review Cloud Monitoring dashboards for pipeline health — processing times, error rates, and data freshness metrics.

Hardware & Environment

Orchestration	Cloud Composer 2 (managed Apache Airflow 2.x)
Batch Processing	Dataflow (managed Apache Beam) — auto-scaling workers
Data Warehouse	BigQuery sandbox (10 GB free, on-demand pricing for exercises)
Streaming	Pub/Sub topics + Dataflow Streaming jobs
Storage	Cloud Storage buckets with Parquet/Delta Lake datasets
Session Length	Persistent environment — pipelines continue between sessions

Frequently asked questions about this lab

What is the Data Engineering Lab? +

Full-stack data engineering environment on GCP. Students build batch and streaming data pipelines, work with data lakes, and create feature stores for ML systems.

Which courses use this lab? +

This lab is included in: MLOps & AI Infrastructure, Data Engineering for AI.

What hardware does this lab run on? +

BigQuery + Cloud Composer + Dataflow. Orchestration: Cloud Composer 2 (managed Apache Airflow 2.x); Batch Processing: Dataflow (managed Apache Beam) — auto-scaling workers; Data Warehouse: BigQuery sandbox (10 GB free, on-demand pricing for exercises); Streaming: Pub/Sub topics + Dataflow Streaming jobs.

What software comes pre-installed? +

Comes pre-loaded with BigQuery, Apache Airflow (Cloud Composer), Apache Beam (Dataflow), Pub/Sub, dbt, Cloud Storage (Delta Lake). No local installs or dependency setup required — open your browser and start working.

Can I bring my own datasets and code into this lab? +

Yes. Datasets can be uploaded directly or synced from Google Cloud Storage. Notebooks and source files have built-in Git integration so you can push work to your own GitHub or GitLab repos.

Do I need to enroll in a course to use this lab? +

Yes. Lab environments are provisioned per-student as part of an AI Labs course enrollment. Browse the courses linked above to find programs that include this lab.

Related labs

Other AI Labs environments students typically use alongside this one.

Ready to Try This Lab?

Enroll in a course that uses this lab, or visit our Houston center for a hands-on demo.

Browse Courses View All Labs