SQL Data Modeling
What is data modeling? Data modeling is a process for representing the database objects in our real-world or business perspective. Objects in warehouses can be datasets, tables, or views. Representing the objects as close as possible to the real world is important because the end users of the data are human. Some of the most common end users are business analysts, data analysts, data scientists, BI users, or any other roles that require access to the data for business purposes.
The data model helps us design our database. When building a plane, you don’t start with building the engine. You start by creating a blueprint anschematic. Creating database is just the same, you start with modelling the data. Model is a representation of real data that provide us with characteristic, relation and rules that apply to our data. It doesn’t actually contain any data in it.
In the classical sense, a data model is simply metadata that described the structure, content, and relationships that exist within a group of related data assets. Maintaining a data model has long been a standard practice for OLTP workloads built on SQL. Typically maintained by data engineers & data architects, they help manage the evolution of assets, remove unnecessary duplication & enforce conventions to maintain an intuitive and consistent layout. A key additional benefit is to inform consumers (ex. data analysts) about assets and how best to use them. For this reason, maintaining data models is also a common practice for managing the evolution of large SQL data warehouses. Without data models, end users would find it challenging to navigate around a library of hundreds (if not thousands) of data assets and correctly leverage them.
Data modeling is a process of designing how data will be represented in data stores. Many data modeling techniques were originally designed for databases and warehouses. Since the Serving layers are usually built with relational data stores such as data warehouses, some of the data modeling techniques can be applied for the Serving layer design too. Serving layer could be built using other storage technologies such as document databases, key-value stores, and so on, based on the customer requirements.
Unlike data lakes, in databases or data warehouses we don't have the luxury of storing huge volumes of data in the format we like. Databases and data warehouses can perform querying exceptionally fast, provided the data is stored in predetermined formats and is limited in size. Hence, while designing the Serving layer, we need to identify the specifics of which data needs to be stored, which format to store it in, and how much data to store. To be specific, we need to decide on which SQL tables are required, what would be the relationship between these tables, and which restrictions need to be imposed on these tables.
What's the main difference between designing a data model in a data warehouse and designing an OLTP database (transactional database) for applications. In the application database, the end users of the database are applications, not humans. In the data warehouse, you serve humans. So, as data engineers, we need to think from their perspective.
Since modern data engineers realize the non-relevancy of the old principles and, at the same time, the demand to insert any data in data warehouses is growing tremendously, they tend to skip the data modeling steps, and that's bad. Skipping the data warehouse principles means ignoring the fact that we need to maintain data consistency. This fact may lead to some bad results. Take the following common example:
- Data is duplicated in many locations.
- Some values are not consistent across different users and reports.
- The cost of processing is highly inefficient.
- The end user doesn't understand how to use the data warehouse objects.
- The business doesn't trust the data.
In my opinion, a modern data warehouse is still a data warehouse. The objective of a data warehouse is to build a centralized and trustworthy data storage that can be used for business. Data engineers need to take more time to do proper data modeling in data warehouses compared to the data lake concept in order to meet this objective.
Let's look at this example. We want to represent people in a table object. Which of the following two tables, A or B, do you think better represents people? Here is People table A:
Try and compare this with People table B:
If we look back at the objective, we want to represent people. Then I think we can all agree that People table A is better at representing people because this table represents people clearly. It's very natural to imagine people having names, ages, hair colors, and genders. A good data model is self-explanatory, like People table A. This means that even without anyone explaining to you how to read the table, you already know what the table is about.
Now, why is table B bad? There are a few reasons:
- The lists of attributes don't really represent people; for example, postal code. Even though we know people may have houses, and houses have postal codes, it's difficult to imagine people as entities having a postal code as part of them.
- What is NULL in relation to postal code? Does that mean Barb doesn't have a house? Or maybe he forgot his postal code? Or perhaps this is just a bug. The table can't really tell you that.
- Still on the subject of the postal code, how about if one of the people here has more than one house? Should we add new records to this table? It will become complicated.
- Gender is inconsistent. Female and Woman, and Male and Man, may have the same meanings, but may not.
- The wealthy column has yes and no values. What does this mean? How can this column be justified?
It is not that the information is wrong - we often need to store such information. Now the question is, can we store the same information, but with a better data model?
Let's take another example. Perhaps this better represents the real world for the required information:
Maybe this Alternative C is better. We still have the people table, but only with people-related attributes, for example, gender. Then, postal code is part of the Address table. It may have other address information, but in this example, we will keep it simple with just the postal code. And if someone such as Barb doesn't have a postal code, then we don't need to put the NULL record there. And lastly, we may assume that wealth is driven by salary (just for example purposes), so we had better just store the salary information, and later use queries to put the wealthy logic on top of the salary information. This is more natural and closer to the real world.
What could happen with a bad data model? It is often the case that the end user will have too much dependency on the data engineering team. Unable to understand the table shapes, end users need to keep posing the following questions to the data engineering team, for example:
- What does NULL mean?
- How should I join the tables?
- Why are there duplications?
- Why do some records have the attribute X while others don't?
In the worst-case scenario, the end user doesn't trust the data in the data warehouse, so the goal of using the data for a business impact has failed.
In the best-case scenario, a perfect data model is where the end user doesn't need to put any questions to the data engineering team. They can answer any business questions just by looking at the table structures and trust the data 100%. And that's our goal as data engineers.
But, at the end of the day, it's very difficult to design a perfect data model because there are other aspects that a data engineer needs to think about when designing a data model.
Other purposes of the data model
Besides representing data in a real-world scenario, there are three other reasons why we require a data model in a data warehouse:
- Data consistency
- Query performance
- Storage efficiency
Let's start with the latest point first: Storage efficiency. How can we improve storage efficiency by the data model?
Take a look at this example again. Which one is more storage-efficient? Perhaps a table with name and gender, where gender is written in a string data type as Man or Woman:
Or perhaps option B? We create a gender reference table, and the main table will only store one character, gender_id, as a reference. The user can later join both tables for the same result as option A.
Option B is definitely better, as we don't need to repeat storing Female and Male strings in our storage. It looks like a small difference, but the same technique applies to all categorical string attributes, and that can have a significant impact.
Using the same technique, we can also improve data consistency. For example, we can use the gender reference table for other tables, as in the following example user table:
With that, we avoid data inconsistency; for example, the People table uses Female-Male, and the User table uses Man-Woman. This is a very common practice, and the common terminology in the data warehouse world to refer to this is normalized and denormalized.
Storage efficiency option A is a denormalized table, while Storage efficiency option B is a normalized table.
Last but not least, one reason why we need a data model is for query performance. In a big data system where data is stored in distributed storage, there is a general rule of thumb regarding which operation is the most resource-intensive, which is JOIN. JOIN in general is a very expensive operation, especially when we need to join multiple large-volume tables. And if you look back at the normalized and denormalized approaches, you will realize that even normalized data is good for storage efficiency and data consistency, but it's bad for performance because you require a lot of Join operations.
At the end of the day, we need to find a balance between all the factors. There will be no right or wrong answer for a data model in a complex data warehouse. In a complex data warehouse, this may involve thousands to millions of tables. So, everyone will have a different approach for designing the data model. However, there are some theories that we can use as reference.
Data Modeling Best Practices
Database design is a critical aspect of any software system. It determines the overall performance, scalability, and security of a system. Here are 9 best practices for designing a database that ensures the longevity and success of your system:
- Normalization: Normalization is the process of organizing data in a relational database so that data redundancies are minimized and data anomalies are prevented. Normalization helps ensure data consistency and eliminates data duplications.
- Proper indexing: Indexing is the process of adding an index to a database column to speed up the search process. Proper indexing helps improve the performance of queries and makes data retrieval faster.
- Use appropriate data types: It is important to use appropriate data types for each column in your database. This helps ensure data accuracy and consistency.
- Avoid over-normalization: While normalization is important, over-normalization can lead to complicated and slow-performing queries. It is important to strike a balance between normalization and efficiency.
- Use appropriate keys: Keys are used to identifying a unique record in a database. It is important to choose the appropriate key for each table, such as primary keys, foreign keys, or composite keys.
- Document your design: Documenting your database design helps ensure that it is clear, consistent, and easy to understand. This documentation also makes it easier to maintain and modify the database in the future.
- Test your design: It is important to test your database design thoroughly before deploying it to production. This helps catch any potential problems and ensures that the design meets all requirements.
- Consider security: Security is a critical aspect of database design. It is important to consider security from the beginning of the design process to ensure that the database is protected from unauthorized access.
- Plan for scalability: As your system grows, so will your database. It is important to plan for scalability from the beginning so that the database can easily handle increasing amounts of data.
Following these 9 best practices for database design will help ensure that your system has a strong foundation for success. Proper planning, documentation, testing, and security considerations will help ensure that your database is efficient, secure, and scalable.
Case Studies
Mobile Phone Billing System Relational Model
Whatnot Dimensional Data Modeling
Data model types:
- Facts are (generally) transaction-like records that should not change after the transaction is completed. They tend to be high velocity (there are a lot of them) and have measurements associated with them, such as price or duration.
- Dimensions describe the objects in a fact table. For example, many orders might come from the same livestream, so the livestream would be considered a “dimension” of the order.
- Bridge Tables — map two entities together when they have a one-to-many or many-to-many relationship. For example, we will have a category dimension and a livestream watches fact. Live streams can have many categories.
- Map Tables — Different from bridge tables, map tables can be thought of as upstream of facts and dimensions. These are intermediary tables that map an ID to a categorization.
Slowly Changing Dimensions:
SCDs are dimensions with tracked changes using valid_from
and valid_to columns
. For example, a user’s address can change over time and each time a change occurs, a new record is created (Type 2 SCD).
Natural Keys vs Surrogate Keys:
Earlier, data users relied on a model’s natural keys to join, which is intuitive and simple, but sometimes confusing — string or int? — and, with the introduction of SCD, could lead to exploding joins. Whatnot decided all data models in their core schema would have a varchar
surrogate key. They generate this for each table using dbt_utils.generate_surrogate_key([entity_id, valid_from])
. By always using a hashed surrogate key, we avoid potential integer-to-varchar join issues. The downside to surrogate keys is that they can be confusing for developers, who are used to joining on “id” columns rather than “keys”.
Resources
- Data Modeling by Packt
- https://knowledgetree.notion.site/Data-Modeling-92b0646bc2674a23a6203d9309bf414f
- Database Schema Design Examples
- Data Modeling Layer & Concepts
- Kimball’s Dimensional Data Modeling
- Modeling Example: A Real-world Use Case
- Kimball Dimensional Modeling Techniques
- Designing a dimensional model for a cargo shipper
- 5 Database Design Schema Example: Critical Practices & Designs
- Kimball’s 4-Step Dimensional Design Process
- Introduction to Data Warehouse
- Data Warehouse Schemas
- Know Data Warehouse Essential Concepts
- Fundamental Data modeling and Data warehousing