Lab: Databricks Deltalake
Objective: Creation of an elementary Data Lakehouse using Databricks and the Delta lake technology
This lab is the creation of an elementary Data Lakehouse - using Databricks and the Delta lake technology - for a database containing sales data of a fictitious company called βNorthwind Tradersβ, which imports and exports specialty foods around the world.
With the Data Lakehouse created, a business analysis is conducted to answer the following questions:
- What are the 5 least sold products?
- What are the top 5 Customers with the highest number of purchases?
- What are the top 5 Customers with the highest purchases value?
- Who was the employee who made more sales last year?
Northwind Database:
Architectureβ
Bronze layer:
- Python.
- COPY INTO (SQL).
Silver layer:
- MERGE INTO (SQL).
- Data types wrangling (SQL).
- Null values wrangling (SQL).
Gold layer and Analysis:
- PySpark.
- Koalas.
- Spark Pandas.
- Spark/Hive SQL.
π Reproducibilityβ
To reproduce the project, follow the steps below:
- Log into your community version of Databricks.
- Import the notebook into your workspace
- Import the
.csv
files from/data
folder into your Databricks DBFS. - Start a Cluster following the
requirements.txt
file guidance. - Customize the DBFS root path of
01-sa-main.ipynb
in the utilities section according to your specifics. - Run all cells in their sequence.
Clusterβ
π Learningsβ
- The project uses the community version of Databricks, which imposes restrictions, such as the use of Delta Live Streams, Cloud Partners Integration, Github Integration and Job Scheduling - the usage of this tools would enrich the project by a lot.
- Only structured data was used in the project, but the workspace and project structure - a Data Lakehouse - remains scalable for using semi-structured and unstructured data - just requires the use-case wrangling.
π£ Roadmapβ
- Transform the Database schema to a Star Schema in the Gold layer.
- Add semi-structured and unstructured data into the Data Lakehouse.