Sparkify

Sparkify SQL Data Modeling with Postgres

In this, we will model the data with Postgres and build an ETL pipeline using Python. The fact and dimension tables for a star database schema for a particular analytic focus is defined, and an ETL pipeline that transfers data from files in two local directories into these tables in Postgres using Python and SQL was developed.

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the application, as well as a directory with JSON meta-data on the songs in their application.

They'd like a data engineer to create a Postgres database with tables designed to optimize queries on song play analysis. The role of this project is to create a database schema and ETL pipeline for this analysis.

We will model the data with Postgres and build an ETL pipeline using Python. The fact and dimension tables for a star database schema for a particular analytic focus is defined, and an ETL pipeline that transfers data from files in two local directories into these tables in Postgres using Python and SQL was developed.

Songs dataset

Songs dataset is a subset of Million Song Dataset. Each file in the dataset is in JSON format and contains meta-data about a song and the artist of that song.

Sample record:

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Logs dataset

Logs dataset is generated by Event Simulator. These log files in JSON format simulate activity logs from a music streaming application based on specified configurations.

Sample record:

{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

Quality

LogDataset : userID - Found users with empty string '', firstName as None
LogDataset : Major of the artist_id & song_id is null
Artists table doesn't have list of all the artists found in log
Songs table doesn't have all the songs found in log
LogDataset : Logs are ordered by timestamp, so they need to be sorted. This enables latest user level to be updated in the users table

Tidiness

LogDataset : ts : timestamp column as int64 needs to converted to timestamp
SongPlays table : Add new column songplay_id as serial ( auto-increment )
user : table : Adding column ts
songplays : table : Adding columns itemInSession, song, artist

Database Schema Design - Entity Relation Diagram (ERD)

The Star Database Schema used for data modeling in this ETL pipeline. There is one fact table containing all the metrics (facts) associated to each event (user actions), and four dimensions tables, containing associated information such as user name, artist name, song meta-data etc. This model enables to search the database schema with the minimum number of SQL JOINs possible and enable fast read queries. The amount of data we need to analyze is not big enough to require big data solutions or NoSQL databases.

data_schema

Sparkify NoSQL Data Modeling with Cassandra

In this, we will model the data with Apache Cassandra and build an ETL pipeline using Python. The ETL pipeline transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. We will create separate denormalized tables for answering specific queries, properly using partition keys and clustering columns.

They'd like a data engineer to create a Apache Cassandra database which can create queries on song play data to answer the questions and make meaningful insights. The role of this project is to create a database schema and ETL pipeline for this analysis.

We will model the data with Apache Cassandra and build an ETL pipeline using Python. The ETL pipeline transfers data from a set of CSV files within a directory to create a streamlined CSV file to model and insert data into Apache Cassandra tables. We will create separate denormalized tables for answering specific queries, properly using partition keys and clustering columns.

Event Dataset

Event dataset is a collection of CSV files containing the information of user activity across a period of time. Each file in the dataset contains the information regarding the song played, user information and other attributes .

List of available data columns :

artist, auth, firstName, gender, itemInSession, lastName, length, level, location, method, page, registration, sessionId, song, status, ts, userId

Keyspace Schema Design

The keyspace design is shown in the image below. Each table is modeled to answer a specific known query. This model enables to query the database schema containing huge amounts of data. Relational databases are not suitable in this scenario due to the magnitude of data.

keyspace

Sparkify Data Lake with AWS and PySpark

In this lab, we will build a data lake on AWS S3 and build an ETL pipeline for a data lake hosted on S3. The data is loaded from S3 and processed into analytics tables using Spark and the processed data is loaded back into S3 in the form of parquet files.

The data stored on S3 buckets is extracted and processed using Spark, and is then inserted into the fact and dimensional tables.

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. Sparkify has grown their user base and song database large and want to move their data warehouse to a data lake. Their data resides in S3, in a directory of JSON logs on user activity on the application, as well as a directory with JSON metadata on the songs in their application.

They'd like a data engineer to build an ETL pipeline that extracts their data from S3, processes them using Spark, and loads the data back into S3 as a set of fact and dimensional tables. This will allow their analytics team to continue finding insights in what songs their users are listening to. The role of this project is to create a data lake on cloud (AWS S3) and build ETL pipeline for this process.

Project Description

In this project, we will build a data lake on AWS S3 and build an ETL pipeline for a data lake hosted on S3. The data is loaded from S3 and processed into analytics tables using Spark and the processed data is loaded back into S3 in the form of parquet files.

Built With

python
AWS

Dataset

Song Dataset

Songs dataset is a subset of Million Song Dataset. Each file in the dataset is in JSON format and contains meta-data about a song and the artist of that song. The dataset is hosted at S3 bucket s3://sparsh-dend/song_data.

Sample Record :

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

Logs dataset is generated by Event Simulator. These log files in JSON format simulate activity logs from a music streaming application based on specified configurations. The dataset is hosted at S3 bucket s3://sparsh-dend/log_data.

Sample Record :

{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

Data Model ERD

The Star Database Schema (Fact and Dimension Schema) is used for data modeling in this ETL pipeline. There is one fact table containing all the metrics (facts) associated to each event (user actions), and four dimensions tables, containing associated information such as user name, artist name, song meta-data etc. This model enables to search the database schema with the minimum number of SQL JOINs possible and enable fast read queries.

The data stored on S3 buckets is extracted and processed using Spark, and is then inserted into the fact and dimensional tables. This tables are stored back to S3 in parquet files, organized for optimal performance. An entity relationship diagram (ERD) of the data model is given below.

database

In this project (Py)Spark is used to process large amount of data for a fictional music streaming service, which is called Sparkify.

Problem

The scenario that should be solved in this project is: Sparkify has gained a lot new users and the song database as well as the recorded song plays have increased over time.

Sparkify has created a dump of the data in Amazon S3 storage. This dump currently consists of JSON logs of the user activity and metadata on the songs.

To improve analysis of the data, the data should be transformed into a star schema. The stored data should then be stored again on S3 for further usage.

Solution

We read the data from with (Py)Spark and transform it into a star schema. Finally we store the results as parquet files from which they can be easily processed further on.

Getting started

A working Python (>= Python 3.6) environment is required.

In this enviroment run

pip install pyspark

or use the provided Anaconda environment

conda create -p ./.conda-env --file conda.txt
conda activate ./.conda-env

Then execute the etl.py script, which can either be executed in a local mode or the remote mode:

Local mode:

# make sure to first unzip the sample data
bash unzip_data.sh

python etl.py local --help   # list all parameters
python etl.py local          # runs the script in local mode (with default params)

Remote mode:

python etl.py remote --help  # list all parameters
python etl.py remote --s3-bucket-target s3a://your-bucket-id

When running in remote mode, make sure to enter AWS credentials in the dl.cfg file first.

Sparkify Data Warehouse with Redshift

Sharpen your data warehousing skills and deepen your understanding of data infrastructure. Create cloud-based data warehouses on Amazon Web Services (AWS). Build a data warehouse on AWS and build an ETL pipeline for a database hosted on Redshift. The data is loaded from S3 buckets to staging tables on Redshift and modeled into fact and dimensions tables to perform analytics and obtain meaningful insights.

In this lab AWS Redshift is used as a Data Warehouse for a fictional music streaming service, which is called Sparkify.

Problem

The scenario that should be solved in this lab is: Sparkify has gained a lot new users and the song database as well as the recorded song plays have increased over time.

Sparkify has created a dump of the data in Amazon S3 storage. This dump currently consists of JSON logs of the user activity and metadata on the songs.

Solution

We use AWS Redshift as a data warehouse in order to make the data available to a wider audience. The data will first be loaded from S3 into a staging area in Redshift, where we create two tables that stores the data in a simlar schema as the dumped data on S3.

In a second step the data is then transformed into a star schema which is optimized for data analytics queries.

database-data-warehousing

Getting started

Make sure to have a AWS Redshift cluster up and running and enter the login details in the dwh.cfg file. The cluster must be configured, such that it can be reached from public if the scripts are not executed within the VPC.

Then start the pipeline:

Create the tables by running python create_tables.py from the terminal.
Run the ETL pipeline to extract the data from S3 to Redshift and transform/load it into the target tables.

Both scripts log also to stdout, so that we can see what is happening while we create tables or load data into Redshift. The ETL step takes quite some time, so please be patient.

Data Layout

The data layout in the Data Warehouse follows the star schema. It consists of one fact table fact_songplay and four dimension tables: dim_user, dim_song, dim_artist and dim_time.

In a larger cluster I expect analytic queries to be user and song centric, which means that these tables will often be joined. As long as these tables do not grow too much in size it makes sense to replicate them to all nodes to mitigate data transfer across nodes. Because of this tables dim_user and dim_song have attached diststyle ALL in their CREATE TABLE statements.

Also I expect queries to be mostly interested in the latest events, which is why fact_songplay.start_time is the leading column for the sort key. This means that Redshift can skip entire blocks that fall outside a time range in queries that select data based on the time.

Since the artist table dim_artist is not shared with all nodes in the cluster, opposite to dim_user and dim_song, the column artist_id (fact_songplay.artist_id, dim_artist.artist_id) as been defined as distribution key. Because dim_artist.artist_id is defined as sort key it should enable the query optimizer to choose a sort merge join instead of a slower hash join.

Transformations

Dimension: Artist

For demonstration purposes: When inserting data into the dim_artist table, I check for NULL values or empty strings in the column location. I don't want to have empty values in this column, so I replace the missing value with N/A.

Dimension: Time

As usual various values are extracted from the timestamp, such as hour, day, week, etc.

Sparkify Data Pipeline with Airflow

Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. Sparkify has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and have come to the conclusion that the best tool to achieve this is Apache Airflow.

Sparkify has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and have come to the conclusion that the best tool to achieve this is Apache Airflow.

They'd like a data engineer to create high grade data pipelines that are dynamic and built from reusable tasks, can be monitored, and allow easy backfills. They have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run data quality tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.

The source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.

Schedule, automate, and monitor data pipelines using Apache Airflow. Run data quality checks, track data lineage, and work with data pipelines in production.

In this milestone, we will build data pipelines using Apache Airflow using custom defined operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step.

The data stored on S3 buckets is staged and then inserted to fact and dimensional tables on Redshift using Airflow pipelines.

Read this for more information

Schedule, automate, and monitor data pipelines using Apache Airflow. Run data quality checks, track data lineage, and work with data pipelines in production.

Task:

Build data pipeline using Apache Airflow using custom defined operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step.

Process steps:

Upload the Sparkify data to S3 Bucket
Update the Bucket and Path information in DAG
Add the helpers and operators in airflow plugins directory
Add connection: {'connection_id':'aws_default', 'connection_type':'amazon_web_services', 'aws_access_key_id':{AWS_ACCESS_KEY}, 'aws_secret_access_key':{AWS_SECRET_KEY}}
Add connection: {'connection_id':'redshift', 'connection_type':'amazon_redshift', 'host':{HOST}, 'port':{PORT}, 'schema':{DATABASE}, 'login':{USER}, 'password':{PASSWORD}}
Run the DAG

In this project Airflow is used to coordinate tasks that process large amounts of data for a fictional music streaming service, which is called Sparkify.

Problem

The scenario that should be solved in this project is: Sparkify has gained a lot new users and the song database as well as the recorded song plays have increased over time.

Sparkify has created a dump of the data in Amazon S3 storage. This dump currently consists of JSON logs of the user activity and metadata on the songs.

To improve analysis of the data, the data should be transformed into a star schema. The stored data should then be stored in a AWS Redshift Data Warehouse.

Solution

We use Airflow to automatically read data in periodic intervals from S3 and write them into a staging table in Amazon Redshift.

We load both datasets in parallel into the staging area.

After that we create fact and dimension tables. First we create the fact dimension table and in a second step we create all dimension tables in parallel.

In a last step we run data quality checks on the transformed data.

Our DAG looks like this:

project5-dag

Getting started

A working Python (>= Python 3.6) environment is required.

In this enviroment run

pip install -r requirements.txt

or use the provided Anaconda environment

conda create -p ./.conda-env python=3.7 --file requirements.txt
conda activate ./.conda-env

Then start Airflow:

bash run-airflow.sh

This will start the Airflow scheduler and webserver in parallel.

Make sure to have a AWS Redshift cluster running. The setup requires you to have two credentials stored in your Airflow instance as connections:

Credentials Name	Description
aws_credentials	Credentials for your AWS user. Use fields `login` and `password` and set your `access_key` and `secret_access_key` here.
redshift	Choose "Postgres" as the connection type. Enter your AWS Redshift credentials here. Your database name should be stored in the field named "Schema".

airflow-credentials-1

airflow-credentials-2

Also make sure that the tables exist on the cluster. The SQL query to create them can be found in the file create_tables.sql.

Then go to https://localhost:8080 and enable the DAG udac_project_dag to start the process.

About The Project

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming application. Sparkify has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines and have come to the conclusion that the best tool to achieve this is Apache Airflow.

Project Description

In this project, we will build data pipelines using Apache Airflow using custom defined operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step.

Built With

python
AWS
Apache Airflow

Dataset

Song Dataset

Sample Record :

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

Log Dataset

Sample Record :

{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

Database Schema Design

Data Model ERD

The data stored on S3 buckets is staged and then inserted to fact and dimensional tables on Redshift. The whole process in orchestrated using Airflow which is set to execute periodically once every hour.

database-data-pipeline

Apache Airflow Orchestration

DAG Structure

The DAG parameters are set according to the following :

The DAG does not have dependencies on past runs
DAG has schedule interval set to hourly
On failure, the task are retried 3 times
Retries happen every 5 minutes
Catchup is turned off
Email are not sent on retry

The DAG dependency graph is given below.

dag

Operators

Operators create necessary tables, stage the data, transform the data, and run checks on data quality. Connections and Hooks are configured using Airflow's built-in functionalities. All of the operators and task run SQL statements against the Redshift database.

Stage Operator

The stage operator loads any JSON formatted files from S3 to Amazon Redshift. The operator creates and runs a SQL COPY statement based on the parameters provided. The operator's parameters specify where in S3, the file is loaded and what is the target table.

Task to stage JSON data is included in the DAG and uses the RedshiftStage operator: There is a task that to stages data from S3 to Redshift (Runs a Redshift copy statement).
Task uses params: Instead of running a static SQL statement to stage the data, the task uses parameters to generate the copy statement dynamically. It also contains a templated field that allows it to load timestamped files from S3 based on the execution time and run backfills.
Logging used: The operator contains logging in different steps of the execution.
The database connection is created by using a hook and a connection: The SQL statements are executed by using a Airflow hook.

Fact and Dimension Operators

The dimension and fact operators make use of the SQL helper class to run data transformations. Operators take as input the SQL statement from the helper class and target the database on which to run the query against. A target table is also defined that contains the results of the transformation.

Dimension loads are done with the truncate-insert pattern where the target table is emptied before the load. There is a parameter that allows switching between insert modes when loading dimensions. Fact tables are massive so they only allow append type functionality.

Set of tasks using the dimension load operator is in the DAG: Dimensions are loaded with on the LoadDimension operator.
A task using the fact load operator is in the DAG: Facts are loaded with on the LoadFact operator.
Both operators use params: Instead of running a static SQL statement to stage the data, the task uses parameters to generate the copy statement dynamically.
The dimension task contains a param to allow switch between append and insert-delete functionality: The DAG allows to switch between append-only and delete-load functionality.

Data Quality Operator

The data quality operator is used to run checks on the data itself. The operator's main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. For each the test, the test result and expected result are checked and if there is no match, the operator raises an exception and the task is retried and fails eventually.

For example one test could be a SQL statement that checks if certain column contains NULL values by counting all the rows that have NULL in the column. We do not want to have any NULLs so expected result would be 0 and the test would compare the SQL statement's outcome to the expected result.

A task using the data quality operator is in the DAG and at least one data quality check is done: Data quality check is done with correct operator.
The operator raises an error if the check fails: The DAG either fails or retries n times.
The operator is parametrized: Operator uses params to get the tests and the results, tests are not hard coded to the operator.

Airflow UI views of DAG and plugins

The DAG follows the data flow provided in the instructions, all the tasks have a dependency and DAG begins with a start_execution task and ends with a end_execution task.

Project structure

Files in this repository:

File / Folder	Description
dags	Folder at the root of the project, where DAGs and SubDAGS are stored
images	Folder at the root of the project, where images are stored
plugins/helpers	Contains a SQL helper class for easy querying
plugins/operators	Contains the custom operator to perform the DAG tasks
create_tables.sql	Contains SQL commands to create the necessary tables on Redshift
README	Readme file

Getting Started

Clone the repository into a local machine using

git clone https://github.com/vineeths96/Data-Engineering-Nanodegree

Prerequisites

These are the prerequisites to run the program.

python 3.7
AWS account
Apache Airflow

How to run

Follow the steps to extract and load the data into the data model.

Set up Apache Airflow to run in local
Navigate to Project 5 Data Pipelines folder
Set up AWS Connection and Redshift Connection to Airflow using necessary values
In Airflow, turn the DAG execution ON
View the Web UI for detailed insights about the operation

Project Structure

├── [ 60K]  01-postgres-datamodel
│   ├── [ 12K]  REPORT.md
│   ├── [ 384]  dwh.cfg
│   ├── [ 47K]  etl.ipynb
│   ├── [ 320]  nbs
│   └── [ 256]  src
├── [106K]  02-cassandra-datamodel
│   ├── [ 13K]  REPORT.md
│   ├── [ 28K]  cassandra-datamodel-ecs.ipynb
│   ├── [ 65K]  cassandra-datamodel.ipynb
│   ├── [ 185]  cassandra-docker-compose.yml
│   ├── [  96]  level1
│   └── [ 256]  level2
├── [ 64K]  03-datalake
│   ├── [1.8K]  _EMRJupyterNotebook.md
│   ├── [7.6K]  _athena.md
│   ├── [6.4K]  _emr.md
│   ├── [ 22K]  _notes.md
│   ├── [ 460]  ecs-params-spark.yml
│   ├── [ 460]  ecs-params.yml
│   ├── [ 192]  nbs
│   ├── [ 224]  outputs
│   ├── [1.2K]  spark-docker-compose.yml
│   ├── [  99]  spark.cfg
│   ├── [ 17K]  spark_datalake.ipynb
│   ├── [4.7K]  spark_etl.py
│   └── [ 640]  src
├── [ 51K]  04-redshift-warehousing
│   ├── [ 11K]  _notes.md
│   ├── [ 39K]  _report.md
│   ├── [ 192]  cfg
│   ├── [ 224]  nbs
│   └── [ 448]  src
├── [ 59K]  05-data-pipeline
│   ├── [ 55K]  _notes.md
│   ├── [ 256]  airflow
│   ├── [3.2K]  airflow_pipeline.ipynb
│   ├── [ 192]  myLabs
│   └── [  96]  src
├── [ 11K]  IMAGES.md
├── [ 31K]  README.md
└── [1.6M]  data
    ├── [  42]  download.sh
    ├── [834K]  event_datafile.csv
    ├── [834K]  event_datafile_new.csv
    └── [ 456]  log_json_path.json

 2.0M used in 19 directories, 27 files

Sparkify

Sparkify SQL Data Modeling with Postgres​

Songs dataset​

Logs dataset​

Quality​

Tidiness​

Database Schema Design - Entity Relation Diagram (ERD)​

Sparkify NoSQL Data Modeling with Cassandra​

**Event Dataset**​

Keyspace Schema Design​

Sparkify Data Lake with AWS and PySpark​

Project Description​

Built With​

Dataset​

Song Dataset​

Log Dataset​

Data Model ERD​

Problem​

Solution​

Getting started​

Sparkify Data Warehouse with Redshift​

Problem​

Solution​

Getting started​

Data Layout​

Transformations​

Dimension: Artist​

Dimension: Time​

Sparkify Data Pipeline with Airflow​

Problem​

Solution​

Getting started​

About The Project​

Project Description​

Built With​

Dataset​

Song Dataset​

Log Dataset​

Database Schema Design​

Data Model ERD​

Apache Airflow Orchestration​

DAG Structure​

Operators​

Stage Operator​

Fact and Dimension Operators​

Data Quality Operator​

Airflow UI views of DAG and plugins​

Project structure​

Getting Started​

Prerequisites​

How to run​

Project Structure​

Sparkify SQL Data Modeling with Postgres

Songs dataset

Logs dataset

Quality

Tidiness

Database Schema Design - Entity Relation Diagram (ERD)

Sparkify NoSQL Data Modeling with Cassandra

Event Dataset

Keyspace Schema Design

Sparkify Data Lake with AWS and PySpark

Project Description

Built With

Dataset

Song Dataset

Log Dataset

Data Model ERD

Problem

Solution

Getting started

Sparkify Data Warehouse with Redshift

Problem

Solution

Getting started

Data Layout

Transformations

Dimension: Artist

Dimension: Time

Sparkify Data Pipeline with Airflow

Problem

Solution

Getting started

About The Project

Project Description

Built With

Dataset

Song Dataset

Log Dataset

Database Schema Design

Data Model ERD

Apache Airflow Orchestration

DAG Structure

Operators

Stage Operator

Fact and Dimension Operators

Data Quality Operator

Airflow UI views of DAG and plugins

Project structure

Getting Started

Prerequisites

How to run

Project Structure