ACLED
Objective
Building data pipeline using Armed Conflict Location & Event Data Project (ACLED) API
Problem Statement
The Armed Conflict Location & Event Data Project (ACLED) collects real-time data on the locations, dates, actors, fatalities, and types of all reported political violence and protest events around the world. Let's imagine that you are working for a media organization and you want to bring processed version of the ACLED data to your analysts so that they can generate war and conflict related insights for their stories.
Your goal is to design and develop a data pipeline for it.
Architecture Diagram
What you'll build
- Data pull from ACLED api
- CSV data ingestion into Postgres database
- Use PySpark for data transformation
- Store the intermediary and final data into S3
- Develop Data pipeline with Airflow
- Create and Trigger the Glue crawler using Airflow operator
- Run the analysis in Athena
- Setting up and using connections and variables in Airflow
- Send an Email to stakeholders about pipeline execution status
DAG Run instructions
- Create a free account on ACLED
- Get the API key and add in the DAG
- Install the python libraries mentioned in the DAG
- Set the environment variables - ACLED key and user, S3 bucket and AWS credentials
- Create a Glue crawler named
acled
- Modify the DAG - update name, owner and change catchup, scheduling and other info as required
- Run the DAG
NOTE
We first tried pipeline 1 but due to limitation in API requests, we decided to go with pipeline 2.