Lab: Realtime Streaming analytics with Apache Kafka and Spark Streaming
Activity 1
Basketista is a retail company and they want to build a real-time analytics solution to get insights about their business orders and take decisions in near real-time environment.
In this activity, you will:
- Setup the Environment - Postgres, Kafka, Spark Streaming applications
- Create Kafka Producer and send data to Kafka topic
- Build ETL logic in PySpark and integrate in Spark Stream
- Create Kafka Consumer and Store the procesed data in Postgres
- Connect to Postgres and validate the data load
Activity 2
The objective is to analyze the “retail_db” dataset, provide reports on the total completed orders, and perform customer and product analytics. This activity is designed to help understand the retail database and generate reports on the completed orders. You should be able to analyze the dataset for this activity to create a report. You will be able to use PySpark, do analyses, and obtain the desired results.
Problem Statement: Customers can purchase products or services from Amazon for consumption and usage. Amazon usually sells products and services in-store. However, some may be sold online or over the phone and shipped to the customer. Clothing, medicine, supermarkets, and convenience stores are examples of their retail operations.
Perform the following tasks on the dataset provided using PySpark:
- Explore the customer records saved in the "customers-tab-delimited“ directory on HDFS
- Show the client information for those who live in California
- Save the results in the result/scenario1/solution folder
- Include the customer's entire name in the output
- Explore the order records saved in the “orders parquet” directory on HDFS
- Show all orders with the order status value "COMPLETE“
- Save the data in the "result/scenario2/solution" directory on HDFS
- Include order number, order date, and current situation in the output
- Explore the customer records saved in the "customers-tab- delimited“ directory on HDFS
- Produce a list of all consumers who live in the city of "Caguas"
- Save the results in the result/scenario3/solutionfolder
- The result should only contain records with the value "Caguas" for the customer city
- Explore the order records saved in the “categories” directory on HDFS
- Save the result files in CSV format
- Save the data in the result/scenario4/solution directory on HDFS
- Use lz4 compression to compress the output
- Explore the customer records saved in the “products_avro" directory on HDFS
- Include the products with a price of more than 1000.0 in the output
- Remove data from the table if the product price is greater than 1000.0
- Save the results in the result/scenario5/solution folder
- Explore the order records saved in the “products_avro” directory on HDFS
- Only products with a price of more than 1000.0 should be in the output
- The pattern "Treadmill" appears in the product name
- Save the data in the result/scenario6/solution directory on HDFS
- Explore the customer records saved in the “orders parquet" directory on HDFS
- Output all PENDING orders in July 2013
- Only entries with the order status value of "PENDING" should be included in the result
- Order date should be in the YYY-MM-DD format
- Save the results in the result/scenario7/solution folder
Files
├── [3.9K] README.md
├── [ 140] data
│ └── [ 44] download.sh
├── [1.5M] etc
│ ├── [772K] lab2-prompt.pdf
│ └── [742K] lab2-solution.pdf
└── [267K] nbs
├── [235K] 01-sa-lab-1-solution.ipynb
└── [ 31K] 02-sa-lab-2-solution.ipynb
1.7M used in 3 directories, 6 files