OLTP vs OLAP
Transactional databases (OLTP)
Transactional databases are used by systems for basic operations: creating, reading, updating, and deleting. Transactional systems are considered the core of the informatization of business processes. With these basic operations, we can create entities such as customers, products, stores, and sales transactions, among others, to store important data. A transactional database is commonly known as online transaction processing (OLTP) considering that this type of database serves online transactional operations between the application and the database. For an organization, transactional databases usually have their data segmented into entities, which can be tables (or not), with or without a relationship between these entities to facilitate the correlation between this data. For example, an e-commerce database can be structured with a table called Shopping_Cart, which represents the products that are being selected in the store during user navigation, and another called Purchases with the completed transaction records. The process of segmenting entities in a database is called normalization. The format of a normalized transactional database is optimized for transactional operations, but it is not the best format for data exploration and analysis.
The following is an example of a relational transactional database:
The preceding figure demonstrates a relational database of transactional workloads in a sales and delivery system. We can see the main entity, Orders, joined to Employees, Shippers, Customers, and Order Details, which then detail all products of this order in the relationship with the Products entity, which looks for information in the Categories and Suppliers entities.
Analytical databases (OLAP)
When the data solution requires a good interface for queries, explorations, and data analysis, the data storage organization is different from transactional databases. To meet this requirement, we prioritize the data aggregations and relationships for data consumption and exploration; this specialized data storage is called an analytical database.
Analytical databases use a process called online analytical processing (OLAP) and have undergone a great evolution in recent years with the emergence of data warehouses and big data platforms.
Analytical databases are constituted through a process of data ingestion, and they are responsible for processing and transforming the data into insights and information and then making this processed information available for consumption. The following steps describe this process:
- Data ingestion -- The process responsible for connecting to transactional databases or other data sources to collect raw transaction information and include it in the analytical database
- Data processing -- The process performed by the OLAP platform to create a data model, organize entities, perform indicator calculations, and define metrics for data consumption
- Data query -- After the data model is loaded with the proper organization for querying, data manipulation and reporting tools can connect to the OLAP platform to perform your queries
The following diagram is an example of a structured data model in an OLAP database:
The following diagram is a simple comparison of OLTP and OLAP databases:
The preceding figure demonstrates the traditional flow of data, which is sourced and stored in transactional OLTP databases and then moved to OLAP analytical databases for data intelligence generation.
IMPORTANT NOTE
There are modern data storage platforms that aim to unite OLTP and OLAP on the same platform, but these databases, often called NewSQL, still need to mature their structures to deliver the best of transactional and analytical worlds in the same database. The industry standard is to keep transactional and analytical data structures separate.
A transactional workload
Relational and non-relational databases can be used as solutions for transactional workloads, which are the databases used to perform basic data storage operations: create, read, update, and delete (CRUD). Transactional operations must be done in sequence, with a transaction control that only confirms the conclusion of this transaction (a process called a commit) when the entire operation is successfully executed. If this does not occur, the transaction is canceled, and all processes are not performed, thus generating a process called rollback.
An important idea to help understand the difference between relational and non-relational databases is ACID, present in most database technologies. These properties are as follows:
- Atomicity: This is the property that controls the transaction and defines whether it was successfully performed completely to commit or must be canceled by performing a rollback. Database technology should ensure atomicity.
- Consistency: For a running transaction, it is important to evaluate consistency between the database state before receiving the data and the database state after receiving the data. For example, in a bank transfer, when funds are added to an account, those funds must have a source. Therefore, it is important to know this source and whether the fund's source exit process has already been performed before confirming the inclusion in this new account.
- Isolation: This property evaluates whether there are multiple executions of transactions similar to the current one and if so, it keeps the database in the same state. It then evaluates whether the execution of transactions was sequential. In the bank transfer example, if multiple transactions are sent simultaneously, it checks whether the amounts have already left the source for all transactions, or you need to review one by one, transaction per transaction.
- Durability: This is responsible for evaluating whether a transaction remains in the committed database even if there is a failure during the process, such as a power outage or latency at the time of recording the record.
ACID properties are not unique to transactional databases; they are also found in analytic databases. At this point, the most important thing is to understand that these settings exist, and you can adjust them as per the requirements of your data solution use case.
Database management systems
Database management systems (DBMSs), which are database software, have ACID properties within their architecture, and in addition to performing these controls, they need to manage several complex situations. For example, if multiple users or systems try to access or modify database records, the database systems need to isolate transactions, perform all necessary validations quickly, and maintain the consistency of the data stored after the transaction is committed. For this, some DBMS technologies work with temporary transaction locks, so that actions are done sequentially. This lock is done during the process of an action executing in that record; for example, in an edit of a field in a table, the lock ends as soon as the commit is executed, confirming that transaction.
Some DBMSs are called distributed databases. These databases have their architecture distributed in different storage and processing locations, which can be on-premises in the company's data center or a different data center in the cloud. Distributed database solutions are widely used to maintain consistency in databases that will serve applications in different geographic locations, but this consistency doesn't need to be synchronous. For example, a mobile game can be played in the United States and Brazil, and the database of this game has some entities (categories, game modes, and so on) that must be shared among all players. But the transactions from the United States player do not necessarily need to appear to the player in Brazil in a real-time way; this transactional data will be synchronized from the United States to Brazil, but in an asynchronous process. Let's understand this process next.
Eventual consistency
All transactions in distributed databases take longer to process than in undistributed databases because it is necessary to replicate the data across all nodes in this distributed system. So, to maintain an adequate replication speed, the distributed databases only synchronize the data that is needed. This is the concept of eventual consistency, which configures ACID to perform replication between the distributed nodes asynchronously, after the confirmation of the transaction on the main node of the database is created. This technique can lead to temporary inconsistencies between database nodes. Ideally, the application connected to a distributed database does not require a guarantee of data ordering. It means that the data relating to this eventual consistency may appear to users with an eventual delay as well. Distributed databases are widely used by social media platforms, for news feeds, likes, and shares, among other features.
In addition to transactional, relational, or non-relational databases, we also have another data workload, the analytical workload, which we will address in the next section.
An analytical workload
The second category of data solutions is the analytical workloads. These analytical solutions are based on high-volume data processing platforms, optimized for querying and exploring, and not for CRUD transactions or with ACID properties. In analytical databases, we aggregate various data sources, such as more than one transactional database, as well as logs, files, images, videos, and everything that can generate information for a business analyst.
This raw data is processed and aggregated, thus generating summaries, trends, and predictions that can support decision-making.
An analytical workload can be based on a specific time or a sequence of dated events. In these workloads, it's common to evaluate only the data that is relevant to the analysis. For example, if you have a sales system with a transactional database (source) with several tables recording all sales, products, categories, and customers, among others, it is important to evaluate which of these tables can be used for the analytical database (destination) and then perform the data connections.
To create an analytical database, it is necessary to perform data ingestion, a process of copying data from sources to the analytical base. For this, a technique called extract, transform, and load (ETL), or the more recent extract, load, and transform (ELT), is used. The following figure demonstrates this process with an example of a transactional database as the data source and the analytical database as the destination:
In the preceding diagram, we can see that transactional databases are storages of information systems that automate business processes. Analytical databases act on simple and advanced data analysis, using, for example, statistical models with the application of machine learning, a branch of artificial intelligence. The data ingestion process is an important process for assembling an analytical database that meets the data solution.