Skip to main content

Movie Review Sentiment Analysis Pipeline

Build a pipeline that expresses the fact artist review sentiment and film review sentiment, based on the data provided by IMDb and TMDb.

The goal of our Data Pipeline is to publish a PDF report to S3 with the following summarised information:

  • Top 10 Films
  • Worst 10 Films
  • Review Sentiment Distibution
  • Top 10 Actors/Actresses in Best Reviewed Films
  • IMDb Average Voting vs TMDb Sentiment Reviews through Years

Process steps:

  1. Modify the variables in DAG
  2. Export the Variables before starting airflow
  3. Run the DAG

The project follows the follow steps:

  • Step 1: Scope the Project and Gather Data
  • Step 2: Explore and Assess the Data
  • Step 3: Define the Data Model
  • Step 4: Run ETL to Model the Data
  • Step 5: Complete Project Write Up

Problem Statement

The project expresses the fact artist review sentiment and film review sentiment, based on the data provided by IMDb and TMDb

Data Sources

Choice of Technology

The present solution allows Data Analysts to perform roll-ups and drill-downs into film review sentiment facts linked to both films and actors.

Since the raw data provided into S3 and reported on demand to the Data Analysis Team who can then perform further analysis on the database, a single write step is required, whereas many reads and aggregations will be performer.

Given those requirements, the choice of technology was the following:

  1. AWS S3 to store the raw files from IMDb and TMDb films, reviews and casting.
  2. AWS Redshift to produce a Data Warehouse with the required dimension and fact tables.
  3. Tensorflow to allow us training a model and running the classification
  4. Apache Airflow to automate our Data Piepeline.

Pipeline