Lab: Pyspark Basics
In this lab, we will use the power of PySpark to perform various activities in databricks environment.
- M&M Counts
- San Francisco Fire Calls
- SQL on US Flights Dataset
- Spark Data Sources
- Spark SQL & UDFs
- File Formats
- Delta Lake
- Taxi Trip Analysis
- Movielens Data Analysis
- MapReduce Practice
- Data Wrangling with Spark
- Data Lake Schema on Read
- Python vs PySpark
- Weather Forecast Analysis
- Candy Sales analysis with PySpark
Electricity Data processing with PySpark
Read and follow these articles
- https://ownyourdata.ai/wp/data-processing-with-spark-intro/
- https://ownyourdata.ai/wp/data-processing-with-spark-data-catalog/
- https://ownyourdata.ai/wp/data-processing-with-spark-schema-evolution/
- https://ownyourdata.ai/wp/data-analysis-electricity-consumption-data/
Code
https://github.com/acirtep/ginlong-data-processing-spark#ginlong-data-processing-spark
Pyspark Databricks ETL
Objective: Building an ETL Pipeline with databricks Pyspark and AWS S3
Sales Orders PySpark
In this notebook, you'll use Spark in Databricks to explore data in files. One of the core ways in which you work with data in Spark is to load data into a Dataframe object, and then query, filter, and manipulate the dataframe to explore the data it contains.
We first download the sales order data and then move it to dbfs file system. Then we load the data into Pyspark dataframe and then apply schema. And then we apply pyspark functions like filter and groupby.
After this, we explore Spark SQL and then plot the charts using matplotlib and seaborn libraries.
Databricks Deltalake
Objective: Explore how to use Delta Lake in a Databricks Spark cluster
Delta Lake is an open source project to build a transactional data storage layer for Spark on top of a data lake. Delta Lake adds support for relational semantics for both batch and streaming data operations, and enables the creation of a Lakehouse architecture in which Apache Spark can be used to process and query data in tables that are based on underlying files in the data lake.
This lab will take approximately 40 minutes to complete.
In this lab, we are exploring various features of deltalake (a lakehouse) that we generally do not get in datalake.