Lab: Lambda CSV to Parquet
Create an S3 bucket and IAM user with user-defined policy. Create Lambda layer and lambda function and add the layer to the function. Add S3 trigger for auto-transformation from csv to parquet and query with Glue.
Objective
Serverless Data Conversion and Loading into Data Lake
Problem Statement
Prolambda sources its data from Salesforce and the source extraction process stores the raw data into Prolambda's data lake.
Currently Prolambda's data engineering team uses a Spark cluster to load that csv by scanning the S3 every few minutes and then converts it into Parquet format using PySpark functions and then saves it back into the data lake's refined layer. Then they run Glue crawler is scheduled to run every 5 mins that picks the latest changes in the source data and update the Glue database.
The process is going well but it is costing the company a lot of money because a Spark cluster needs to be active all the time and the glue crawler's DPU units is through the roof which again costs a huge bill.
Your goal is to design and develop a data pipeline which solves the same purpose but at a reduced cost.
Use Cases
- Design and develop a data pipeline for 1) CSV to Parquet conversion, 2) Glue registry and database update
Architecture Diagram
What you'll build
- Create an S3 bucket
- IAM user with user-defined policy
- Create and Upload
AWS Wrangler
Lambda layer - Develop the python function for csv to parquet conversion
- Python code to update Glue database
- Add the layer to the function
- Add trigger in Lambda for automation
- Query with Athena