Lab: Pyspark Basics

In this lab, we will use the power of PySpark to perform various activities in databricks environment.

Electricity Data processing with PySpark

Read and follow these articles

Code

https://github.com/acirtep/ginlong-data-processing-spark#ginlong-data-processing-spark

Pyspark Databricks ETL

Objective: Building an ETL Pipeline with databricks Pyspark and AWS S3

arch drawio

Notebook

Sales Orders PySpark

In this notebook, you'll use Spark in Databricks to explore data in files. One of the core ways in which you work with data in Spark is to load data into a Dataframe object, and then query, filter, and manipulate the dataframe to explore the data it contains.

We first download the sales order data and then move it to dbfs file system. Then we load the data into Pyspark dataframe and then apply schema. And then we apply pyspark functions like filter and groupby.

After this, we explore Spark SQL and then plot the charts using matplotlib and seaborn libraries.

Notebook

Databricks Deltalake

Objective: Explore how to use Delta Lake in a Databricks Spark cluster

Delta Lake is an open source project to build a transactional data storage layer for Spark on top of a data lake. Delta Lake adds support for relational semantics for both batch and streaming data operations, and enables the creation of a Lakehouse architecture in which Apache Spark can be used to process and query data in tables that are based on underlying files in the data lake.

This lab will take approximately 40 minutes to complete.

In this lab, we are exploring various features of deltalake (a lakehouse) that we generally do not get in datalake.

Notebook

Lab: Pyspark Basics

Electricity Data processing with PySpark​

Read and follow these articles​

Code​

Pyspark Databricks ETL​

Sales Orders PySpark​

Databricks Deltalake​

Electricity Data processing with PySpark

Read and follow these articles

Code

Pyspark Databricks ETL

Sales Orders PySpark

Databricks Deltalake