Project: BedBricks
Databricks PySpark Ecommerce Data Processing Case Study
Welcome to the Apache Spark Programming with Databricks course. This course is part of the Apache Spark Developer learning pathway and was designed to help you prepare for the Apache Spark Developer Certification exam.
In this course, you will start by visualizing and applying Spark architecture concepts in example scenarios. Then, you will explore and preprocess datasets by applying a variety of DataFrame transformations and actions. After ingesting data from various file formats, you will apply these preprocessing steps and write them to Delta tables. The case study then expands to stream from Delta in an analytics use case that demonstrates core Structured Streaming concepts. Lastly, you will explore the Spark UI and how query optimization, partitioning, and caching affect performance.
Learning objectives
- Identify core features of Spark and Databricks.
- Describe how DataFrames are created and evaluated in Spark.
- Apply the DataFrame transformation API to process and analyze data.
- Demonstrate how Spark is optimized and executed on a cluster.
- Apply Delta and Structured Streaming to process streaming data.
Prerequisites
- Familiarity with basic SQL concepts (select, filter, groupby, join, etc)
- Beginner programming experience with Python or Scala (syntax, conditions, loops, functions)