Skip to main content

Lab: Pyspark Basics

In this lab, we will use the power of PySpark to perform various activities in databricks environment.

  1. M&M Counts
  2. San Francisco Fire Calls
  3. SQL on US Flights Dataset
  4. Spark Data Sources
  5. Spark SQL & UDFs
  6. File Formats
  7. Delta Lake
  8. Taxi Trip Analysis
  9. Movielens Data Analysis
  10. MapReduce Practice
  11. Data Wrangling with Spark
  12. Data Lake Schema on Read
  13. Python vs PySpark
  14. Weather Forecast Analysis
  15. Candy Sales analysis with PySpark

Electricity Data processing with PySpark

Read and follow these articles

  1. https://ownyourdata.ai/wp/data-processing-with-spark-intro/
  2. https://ownyourdata.ai/wp/data-processing-with-spark-data-catalog/
  3. https://ownyourdata.ai/wp/data-processing-with-spark-schema-evolution/
  4. https://ownyourdata.ai/wp/data-analysis-electricity-consumption-data/

Code

https://github.com/acirtep/ginlong-data-processing-spark#ginlong-data-processing-spark

Pyspark Databricks ETL

Objective: Building an ETL Pipeline with databricks Pyspark and AWS S3

arch drawio

Notebook

Sales Orders PySpark

In this notebook, you'll use Spark in Databricks to explore data in files. One of the core ways in which you work with data in Spark is to load data into a Dataframe object, and then query, filter, and manipulate the dataframe to explore the data it contains.

We first download the sales order data and then move it to dbfs file system. Then we load the data into Pyspark dataframe and then apply schema. And then we apply pyspark functions like filter and groupby.

After this, we explore Spark SQL and then plot the charts using matplotlib and seaborn libraries.

Notebook

Databricks Deltalake

Objective: Explore how to use Delta Lake in a Databricks Spark cluster

Delta Lake is an open source project to build a transactional data storage layer for Spark on top of a data lake. Delta Lake adds support for relational semantics for both batch and streaming data operations, and enables the creation of a Lakehouse architecture in which Apache Spark can be used to process and query data in tables that are based on underlying files in the data lake.

This lab will take approximately 40 minutes to complete.

In this lab, we are exploring various features of deltalake (a lakehouse) that we generally do not get in datalake.

Notebook