📄️ Basics
What is Data Engineering?
📄️ Data Pipelines
img
📄️ OLTP vs OLAP
Transactional databases (OLTP)
📄️ Data Storages
| Architecture | Total cost of solution | Flexibility of scenarios | Complexity of development | Maturity of ecosystem | Organizational maturity required |
📄️ SQL vs NoSQL
As you design large systems ( or even smaller ones), you need to decide the inflow-processing and outflow of data coming- and getting processed in the system.
📄️ Big Data
Six Vs of big data
📄️ Batch vs Incremental
The idea behind incremental processing is quite simple. Incremental processing extends the semantics of processing streaming data to batch processing pipelines by processing only new data each run and then incrementally updating the new results. This unlocks great cost savings due to much shorter batch pipelines as well as data freshness speedups due to being able to run them much more frequently as well.
📄️ Data Contract
Example
📄️ Data Governance
What is Data Governance?
📄️ Data Management
Importance of data management
📄️ Data Quality
Do your product dashboards look funky? Are your quarterly reports stale? Is the data set you're using broken or just plain wrong? Have you ever been about to sign off after a long day running queries or building data pipelines only to get pinged by your head of marketing that “the data is missing” from a critical report? What about a frantic email from your CTO about “duplicate data” in a business intelligence dashboard? Or a memo from your CEO, the same one who is so bullish on data, about a confusing or inaccurate number in his latest board deck? If any of these situations hit home for you, you’re not alone. These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner.
📄️ Batch Data Processing
Data processing involves taking source data which has been ingested into your data platform and cleansing it, combining it, and modeling it for downstream use. Historically the most popular way to transform data has been with the SQL language and data engineers have built data transformation pipelines using SQL often with the help of ETL/ELT tools. But recently many folks have also begun adopting the DataFrame API in languages like Python/Spark for this task. For the most part a data engineer can accomplish the same data transformations with either approach, and deciding between the two is mostly a matter of preference and particular use cases. That being said, there are use cases where a particular data transform can't be expressed in SQL and a different approach is needed. The most popular approach for these use cases is Python/Spark along with a DataFrame API.
📄️ Stream and Unified Data Processing
What is an event stream?
📄️ Orchestration
Workflow orchestration tools are software platforms that help organizations manage and automate complex business processes across different systems and teams. These tools allow businesses to define, schedule, monitor, and manage workflows, which can help streamline operations, reduce errors, and increase productivity.
📄️ 25 Most Common Interview Questions
25 Most Common Interview Questions
📄️ 50 Most Common Interview Questions
50 Most Common Interview Questions
📄️ 50 Most Asked Data Engineer Interview Questions and Answers in 2023
50 Most Asked Data Engineer Interview Questions and Answers in 2023