Data Engineering

📄️ Data Storages

📄️ SQL vs NoSQL

As you design large systems ( or even smaller ones), you need to decide the inflow-processing and outflow of data coming- and getting processed in the system.

The idea behind incremental processing is quite simple. Incremental processing extends the semantics of processing streaming data to batch processing pipelines by processing only new data each run and then incrementally updating the new results. This unlocks great cost savings due to much shorter batch pipelines as well as data freshness speedups due to being able to run them much more frequently as well.

📄️ Data Contract

Example

📄️ Data Governance

What is Data Governance?

📄️ Data Management

Importance of data management

📄️ Data Quality

Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? Have you ever been about to sign off after a long day running queries or building data pipelines only to get pinged by your head of marketing that “the data is missing” from a critical report? What about a frantic email from your CTO about “duplicate data” in a business intelligence dashboard? Or a memo from your CEO, the same one who is so bullish on data, about a confusing or inaccurate number in his latest board deck? If any of these situations hit home for you, you’re not alone. These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner.

📄️ Batch Data Processing

Data processing involves taking source data which has been ingested into your data platform and cleansing it, combining it, and modeling it for downstream use. Historically the most popular way to transform data has been with the SQL language and data engineers have built data transformation pipelines using SQL often with the help of ETL/ELT tools. But recently many folks have also begun adopting the DataFrame API in languages like Python/Spark for this task. For the most part a data engineer can accomplish the same data transformations with either approach, and deciding between the two is mostly a matter of preference and particular use cases. That being said, there are use cases where a particular data transform can't be expressed in SQL and a different approach is needed. The most popular approach for these use cases is Python/Spark along with a DataFrame API.

📄️ Stream and Unified Data Processing

What is an event stream?

📄️ Orchestration

Workflow orchestration tools are software platforms that help organizations manage and automate complex business processes across different systems and teams. These tools allow businesses to define, schedule, monitor, and manage workflows, which can help streamline operations, reduce errors, and increase productivity.

📄️ Basics

📄️ Data Pipelines

📄️ OLTP vs OLAP

📄️ Data Storages

📄️ SQL vs NoSQL

📄️ Big Data

📄️ Batch vs Incremental