Building End to end data pipeline in AWS

Architecture Diagram

Activity 1: Ingestion with DMS

In this activity, you will complete the following tasks using AWS CloudFormation template:

Create the source database environment.
Hydrate the source database environment.
Update the source database environment to demonstrate Change Data Capture (CDC) replication within DMS.
Create a Lambda function to trigger the CDC data to be replicated to Amazon S3 from the DMS CDC endpoint.

Relevant information about this activity:

Expected setup time | 15 minutes
Source database name | sportstickets
Source schema name | dms_sample
Database credentials: adminuser/admin123

Activity 2: Data Lake Hydration

In this activity, you will complete the following prerequisites using an AWS CloudFormation template:

Create required VPC for AWS DMS instance.
Create Amazon S3 bucket for destination endpoint configuration.
Create Amazon S3 buckets for Amazon Athena query result storage.
Create required Amazon S3 bucket policy to put data from the AWS DMS service.
Create AWS Glue Service Role to use in later section of project.
Create Amazon Athena workgroup users to use in the Athena project.
Create Amazon Lake formation users to use in the Lake Formation project.

Activity 3: DMS Migration

You will migrate data from an existing Amazon Relational Database Service (Amazon RDS) Postgres database to an Amazon Simple Storage Service (Amazon S3) bucket that you create.

In this activity you will complete the following tasks:

Create a subnet group within the DMS activity VPC
Create a DMS replication instance
Create a source endpoint
Create a target endpoint
Create a task to perform the initial migration of the data.
Create target endpoint for CDC files to place these files in a separate location than the initial load files
Create a task to perform the ongoing replication of data changes

Activity 4: Transforming data with Glue - Data Validation and ETL

In this activity, you will:

Create Glue Crawler for initial full load data
Create Glue Crawler for Parquet Files
Transforming data with Glue - Incremental Data Processing with Hudi
Create glue job and create HUDI table
Query the HUDI table in Athena
Upsert Incremental Changes
Run Incremental Queries using Spark SQL

Activity 5: Query and Visualize

In this activity, you will:

Query the data with Amazon Athena
Connect Athena to Quicksight
Build Dashboard in Amazon Quicksight

Project Structure

.
├── [ 61K]  01-sa.ipynb
├── [ 32K]  cfn
│   ├── [ 14K]  dms.yml
│   └── [ 18K]  hydration.yml
├── [708K]  img
│   ├── [297K]  arch-diagram.png
│   ├── [ 84K]  athena-federated.png
│   └── [327K]  dashboard.png
├── [3.1K]  README.md
└── [ 14K]  src
    ├── [ 11K]  glu_hdi_ticket_purchase_hist.py
    └── [3.2K]  glu_hdi_ticket_purchase_hist_incremental.py

 818K used in 3 directories, 9 files

Building End to end data pipeline in AWS

Architecture Diagram​

Activity 1: Ingestion with DMS​

Activity 2: Data Lake Hydration​

Activity 3: DMS Migration​

Activity 4: Transforming data with Glue - Data Validation and ETL​

Activity 5: Query and Visualize​

Project Structure​