Twitter data Topic Analysis and Realtime Sentiment Analysis

Problem Statement

Twitter is one of the most popular social networking website in the world. Every second, on average, around 8,000 tweets are tweeted on Twitter, which corresponds to over 450,000 tweets sent per minute, 650 million tweets per day and around 250 billion tweets per year. As claimed by the official site, Twitter data is the most comprehensive source of live, public conversation worldwide. Furthermore, Twitter allows developers to access their tweet data through their APIs that enable programmatic analysis of data in real-time or back to the first Tweet in 2006. This project aims to utilize the tweet data and combine the data with world happiness index data and earth surface temperature data and warehouse them on AWS. The Twitter data extraction could be limited to specific topics/hash-tags as per requirements which allows us to explore various domains.

In this project, we combine Twitter data, World happiness index data and Earth surface temperature data data to explore whether there is any correlation between the above. The Twitter data is dynamic and the other two dataset are static in nature. The general idea of this project is to extract Twitter data, analyze its sentiment and use the resulting data to gain insights with the other datasets. For instance, we could answer interesting questions like whether positive or negative tweets are correlated with the happiness index of the country a person is residing in, or, is there a relationship between the sentiment of a tweet and the temperature change in a country a user is living in?

#:draft Build an ETL pipeline that sources data from Twitter, apply transformation, and stores it in a Data Warehouse

We will use Twitter API to get the data on a particular topic (dark-net for now) and put a filter on request to get data starting from 7 days ago. Then we will schedule our pipeline to run on 7 day interval. Thus avoiding conflicts of fetching and storing same data again. We will fetch the data and put it in a NoSQL database, CouchDB for our case. The reason to put it in a intermediary database and not straight up in the DW is to perform transformation operations. This allows us with more flexibility to run operations such as filtering, data cleaning, or building summaries for the data being inserted in Data Warehouse. #:

Query

Create an Amazon S3 bucket for customer data
Copy data into Amazon S3 bucket
Create a database in the AWS Glue Data Catalog
Create a table in the AWS Glue Data Catalog with proper schema, format, and Amazon S3 location
Use Amazon Athena to query using SQL

Visualization

Create QuickSight data source for Amazon Athena
Create QuickSight data set for our data
Create QuickSight analysis

Streaming Data

Create an Amazon Kinesis Data Stream for Twitter data
Create an Amazon S3 bucket/folder to persist the Twitter data
Create Amazon Kinesis Data Firehose to
- Read from Amazon Kinesis Data Stream
- Write to Amazon S3

Enrichment

Create an Amazon S3 bucket/folder for the enriched data
Create an AWS IAM role for the AWS Lambda function with proper permissions
Create an AWS Lambda function that
- Reads from Amazon S3
- Calls Amazon Comprehend to determine customer sentiment and extract key phrases
- Writes output to enriched Amazon S3 location
Create an Amazon S3 event to trigger the Lambda function on new data arrival

Insight

Create tables and views in the AWS Glue Data Catalog with proper schema, format, and Amazon S3 location
- Sentiment and phrases data
- Sentiment and phrases joined with customer status data
Create QuickSight data sets for our data
Create visualizations in Amazon QuickSight
- Word cloud for all Tweets
- Word cloud for Tweets with negative sentiment
- Word cloud for Tweets with negative sentiment for users who canceled
- Histogram of key terms in Tweets with negative sentiment for users who canceled

Data Model

Project Structure

├── [ 61K]  01-sa-main.ipynb
├── [ 34K]  02-sa-preprocessing.ipynb
├── [ 37K]  03-sa-twitter-couchdb.ipynb
├── [4.1K]  README.md
├── [ 17K]  airflow
│   ├── [ 11K]  dags
│   │   └── [7.2K]  dag.py
│   ├── [ 326]  run.sh
│   └── [6.0K]  sql
│       ├── [2.6K]  create_tables.sql
│       ├── [ 347]  happiness_insert.sql
│       ├── [ 369]  sources_insert.sql
│       ├── [ 748]  temperature_insert.sql
│       ├── [ 882]  time_insert.sql
│       ├── [ 808]  tweets_insert.sql
│       └── [  70]  users_insert.sql
├── [6.5K]  assets
│   └── [6.4K]  _DATADICT.md
├── [ 471]  config.cfg
├── [ 141]  data
│   └── [  45]  download.sh
├── [ 250]  download.sh
└── [1.7K]  requirements.txt

 332K used in 7 directories, 22 files

Twitter data Topic Analysis and Realtime Sentiment Analysis

Problem Statement​

Query​

Visualization​

Streaming Data​

Enrichment​

Insight​

Data Model​

Project Structure​