Lab: dbt BigQuery on Stack Exchange data
Extract, Load and Transform the Stack Exchange data using dbt and google bigquery warehouse.
The data is the publicly available Stack Exchange Data Dump. Users.xml
and Posts.xml
were converted to users.csv.gz
and posts.csv.gz
and used as the source files for this project.
For the ISO 3166-1 country codes, the CSV used from DataHub was used (as country_codes.csv
).
Pre-requisites
For Google Cloud:
- Create a project
- Create a Cloud Storage Bucket and upload
posts.csv.gz
,users.csv.gz
andcountry_codes.csv
files
For dbt:
- Python 3.x installed
- Install packages from
requirements.txt
- Copy the ISO 3166-1 country codes CSV into
./seeds/country_codes.csv
- Setup a dbt profile in
~/.dbt/profiles.yml
calledbigquery_dbt
for BigQuery (Example)
Running
Make BigQuery dataset
bq mk --dataset ${PROJECT_ID}:${DATASET}
Load files into BigQuery as tables (can be done concurrently)
bq load \
--autodetect \
--source_format=CSV \
${DATASET}.posts \
gs://${BUCKET_NAME}/posts.csv.gz
bq load \
--autodetect \
--source_format=CSV \
${DATASET}.users \
gs://${BUCKET_NAME}/users.csv.gz
Ensure Google project id is specified in
database
field inschema.yml
Run dbt
dbt build # Load CSV as reference table (via seeds), run tests etc.
dbt run
- Load created table into GCS
bq extract \
--destination_format CSV \
--compression GZIP \
--field_delimiter ',' \
${PROJECT_ID}:${DATASET}.aggregated_users \
gs://${BUCKET_NAME}/dbt_bigquery/agg_users.csv.gz
Files
├── [2.1K] README.md
├── [1.1K] dbt_project.yml
└── [1.8K] models
├── [1.2K] aggregated_users.sql
└── [ 525] schema.yml
5.2K used in 1 directory, 4 files