Lab: Databricks Deltalake

Objective: Creation of an elementary Data Lakehouse using Databricks and the Delta lake technology

This lab is the creation of an elementary Data Lakehouse - using Databricks and the Delta lake technology - for a database containing sales data of a fictitious company called “Northwind Traders”, which imports and exports specialty foods around the world.

With the Data Lakehouse created, a business analysis is conducted to answer the following questions:

What are the 5 least sold products?
What are the top 5 Customers with the highest number of purchases?
What are the top 5 Customers with the highest purchases value?
Who was the employee who made more sales last year?

Northwind Database:

erd

Architecture

arch

Bronze layer:

Python.
COPY INTO (SQL).

Silver layer:

MERGE INTO (SQL).
Data types wrangling (SQL).
Null values wrangling (SQL).

Gold layer and Analysis:

PySpark.
Koalas.
Spark Pandas.
Spark/Hive SQL.

🗄 Reproducibility

To reproduce the project, follow the steps below:

Log into your community version of Databricks.
Import the notebook into your workspace
Import the .csv files from /data folder into your Databricks DBFS.
Start a Cluster following the requirements.txt file guidance.
Customize the DBFS root path of 01-sa-main.ipynb in the utilities section according to your specifics.
Run all cells in their sequence.

Cluster

cluster_config

cluster_libs

📚 Learnings

The project uses the community version of Databricks, which imposes restrictions, such as the use of Delta Live Streams, Cloud Partners Integration, Github Integration and Job Scheduling - the usage of this tools would enrich the project by a lot.
Only structured data was used in the project, but the workspace and project structure - a Data Lakehouse - remains scalable for using semi-structured and unstructured data - just requires the use-case wrangling.

🛣 Roadmap

Transform the Database schema to a Star Schema in the Gold layer.
Add semi-structured and unstructured data into the Data Lakehouse.

Lab: Databricks Deltalake

Architecture​

🗄 Reproducibility​

Cluster​

📚 Learnings​

🛣 Roadmap​

Architecture

🗄 Reproducibility

Cluster

📚 Learnings

🛣 Roadmap