Lab: Redshift Slowly Changing Dimension
Data loading into a SCD table involves a first-time bulk data loading, referred to as the initial data load. This is followed by continuous or regular data loading, referred to as an incremental data load, to keep the records up to date with changes in the source tables.
To demonstrate the solution, we walk through the following steps for initial data load (1–7) and incremental data load (8–12):
- Land the source data files in an Amazon S3 location, using one subfolder per source table.
- Use an AWS Glue crawler to parse the data files and register tables in the AWS Glue Data Catalog.
- Create an external schema in Amazon Redshift to point to the AWS Glue database containing these tables.
- In Amazon Redshift, create one view per source table to fetch the latest version of the record for each primary key (
customer_id
) value. - Create the
dim_customer
table in Amazon Redshift, which contains attributes from all relevant source tables. - Create a view in Amazon Redshift joining the source table views from Step 4 to project the attributes modeled in the dimension table.
- Populate the initial data from the view created in Step 6 into the
dim_customer
table, generatingcustomer_sk
. - Land the incremental data files for each source table in their respective Amazon S3 location.
- In Amazon Redshift, create a temporary table to accommodate the change-only records.
- Join the view from Step 6 and
dim_customer
and identify change records comparing the combined hash value of attributes. Populate the change records into the temporary table with anI
,U
, orD
indicator. - Update
rec_exp_dt
indim_customer
for allU
andD
records from the temporary table. - Insert records into
dim_customer
, querying allI
andU
records from the temporary table.