Reddit Submissions, Authors and Subreddits analysis
Problem Statement
Reddit is "the frontpage of the internet" and has a broad range of discussions about nearly every topic one can think of. This makes it a perfect candidate for analytics. This project will use a sample of publicly available dump of Reddit and load it into a AWS Redshift warehouse so that Data Scientists can make use of the content and for example develop a recommender system that finds the most suitable subreddit for your purposes.
The goal of this project is to create a Data Warehouse to analyze trending and new subreddits using Airflow.
The project uses the Reddit API to get subreddits and stores them on AWS S3 in JSON format. Data processing happens on an EMR cluster on AWS using PySpark and processed data gets stored on AWS S3 in parquet format. Finally, the data gets inserted into AWS Redshift, gets denormalized to create fact and dimension tables.
Data Sources
Airflow
Pipeline 1
Setting up the airflow environment
Output
Pipeline 2
#TODO: This pipeline is yet to be tested.
Pipeline 3
Architecture:
Project Structure
├── [ 18K] 01-sa.ipynb
├── [4.6K] README.md
├── [ 16K] REPORT.md
├── [ 768] airflow
│ ├── [ 288] dags
│ ├── [ 160] plugins
│ └── [ 160] plugins_2
├── [ 148] data
│ └── [ 52] download.sh
├── [134K] nbs
│ ├── [ 55K] create_dates_and_holiday_dataset.ipynb
│ └── [ 79K] data-model-explained.ipynb
└── [ 14K] src
├── [1.8K] data_quality_queries.py
├── [1.4K] download_datasets.sh
├── [1.5K] make_chunks.py
├── [1.4K] preprocess_authors.py
├── [ 922] sample_dataset.sh
├── [ 192] sql
└── [6.4K] sql_queries.py
187K used in 8 directories, 12 files