Lab: Real Time Apache Log Analytics with Kinesis
Objective
Build ETL pipelines with Kinesis processing Apache Logs data
Direct PUT Pipeline
Direct PUT is a method to send data directly from the clients to Kinesis Data Firehose. In this part, you'll create a Firehose Delivery Stream and will use a script to send data to Firehose with Direct PUT using AWS SDK for Python (boto3). Firehose receives the records and delivers them to S3 into a configured bucket/folder and partitions the incoming records based on the their arrival date and time.
- Create the Kinesis Firehose Delivery Stream
- Send the data using Boto3
firehose
client - Check the ingested data in S3
Send via Kinesis Data Streams
- Create Kinesis Data Stream
- Create Firehose Delivery Stream
- Set up Amazon Kinesis Data Generator
- Check the ingested data in S3
Anomaly Detection with Kinesis and Glue Jobs
In this module, you will learn how to ingest, process, and consume streaming data using AWS serverless services such as Kinesis Data Streams, Glue, S3, and Athena. To simulate the data streaming input, we will use Kinesis Data Generator (KDG).
- Create the Infra by using
./KinesisGlueETL/template.yml
template - Set up the Kinesis Stream
- Create table for Kinesis stream source in the glue data catalog
- Create and trigger the glue streaming job
- Trigger the streaming data from KDG
- Verify the glue stream Job
- Create Glue crawler for the transformed data
- Trigger the abnormal transaction data from KDG
- Detect abnormal transactions using Ad-hoc query from Athena
Log analytics with Kinesis Firehose
Log analytics is a use case that allows you to analyze log data from websites, mobile devices, servers, sensors, and more for a wide variety of applications such as security event monitoring, digital marketing, application monitoring, fraud detection, ad tech, gaming, and IoT. In this lab, you will learn how to ingest and deliver Apache logs to Amazon S3 using Amazon Kinesis Data Firehose without managing any infrastructure. You can then use Amazon Athena to query log files to understand access patterns and web site performance issues.
- Create the Infra by using
./LogAnalyticsFirehose/template.yml
template - Send Apache access logs to Kinesis Firehose
- Check the ingested data in S3
- Analyze the data in Athena
Streaming ETL using Kinesis Firehose
- Create the Infra by using
./FirehoseKinesisStreamETL/template.yml
template - Send Apache access logs to Kinesis Firehose
- Check the transformed data in S3
- Analyze the data in Athena
Optimize data streaming for storage and performance
While Apache access logs can provide insights into web applicaton usage, analyzing log files can be challenging considering the volume of data that a busy web application can generate. The queries you run on JSON data will become slower with the increasing data volumes. We can address this issue by converting the JSON input data into Apache Parquet or Apache ORC . Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON. Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3.
In this module, we will show you how you can convert the incoming tab delimited files into JSON using AWS Lambda function and then use record format conversion feature of Firehose to covert the JSON data into Parquet format before sending it to S3. We will use AWS Glue to store metadata. Finally, we query Parquet formatted data using Amazon Athena.
- Create the Infra by using
./FirehoseKinesisStreamETLParquet/template.yml
template - Send Apache access logs to Kinesis Firehose
- Check the ingested data in S3
- Run the Crawler
- Analyze the Glue table
- Querying Real Time Data
Files
├── [8.1K] 01-sa-kinesis-data-firehose.ipynb
├── [3.5K] 02-sa-kinesis-streaming-etl.ipynb
├── [4.4K] 03-sa-kinesis-log-analytics.ipynb
├── [4.2K] 04-sa-kinesis-etl-optimized.ipynb
├── [ 16K] 05-sa-kinesis-etl-glue.ipynb
├── [ 11K] FirehoseKinesisStreamETL
│ ├── [6.1K] datagen.yml
│ └── [4.6K] template.yml
├── [ 23K] FirehoseKinesisStreamETLParquet
│ ├── [6.1K] datagen.yml
│ ├── [8.1K] template-ref.yml
│ └── [8.2K] template-sparsh.yml
├── [ 18K] KinesisGlueETL
│ └── [ 18K] template.yml
├── [8.1K] LogAnalyticsFirehose
│ ├── [6.1K] datagen.yml
│ └── [1.9K] template.yml
├── [1.1K] Makefile
└── [4.7K] README.md
102K used in 4 directories, 15 files