Data Storages
Architecture | Total cost of solution | Flexibility of scenarios | Complexity of development | Maturity of ecosystem | Organizational maturity required |
---|---|---|---|---|---|
Cloud data warehouse | High - given cloud data warehouses rely on proprietary data formats and offer an end to end solution together, the cost is high | Low - Cloud data warehouses are optimized for BI/SQL based scenarios, there is some support for data science/exploratory scenarios which is restrictive due to format constraints | Low - there is less moving parts and you can get started almost immediately with an end to end solution | High - for SQL/BI scenarios, Low - for other scenarios | Low - the tools and ecosystem are largely well understood and ready to be consumed by organizations of any shape/size. |
Modern data warehouse | Medium - the data preparation and historical data can be moved to the data lake at lower cost, still need a cloud warehouse which is expensive | Medium - diverse ecosystem of tools nad more exploratory scenarios supported in the data lake, correlating data in the warehouse and data lake needs data copies | Medium - the data engineering team needs to ensure that the data lake design is efficient and scalable, plenty of guidance and considerations available, including this book | Medium - the data preparation and data engineering ecosystem, such as Spark/Hadoop has a higher maturity, tuning for performance and scale needed, High - for consumption via data warehouse | Medium - the data platform team needs to be skilled up to understand the needs of the organization and make the right design choices at the least to support the needs of the organization |
Data lakehouse | Low - the data lake storage acts as the unified repository with no data movement required, compute engines are largely stateless and can be spun up and down on demand | High - flexibility of running more scenarios with a diverse ecosystem enabling more exploratory analysis such as data science, and ease of sharing of data between BI and data science teams | Medium to High - careful choice of right datasets and the open data format needed to support the lakehouse architecture | Medium to High - while technologies such as Delta Lake, Apache Iceberg, and Apache Hudi are gaining maturity and adoption, today, this architecture requires thoughtful design | Medium to High - the data platform team needs to be skilled up to understand the needs of the organization and the technology choices that are still new |
Data mesh | Medium - while the distributed design ensures cost is lower, lot of investment required in automation/blueprint/data governance solutions | High - flexibility in supporting different architectures and solutions in the same organization, and no bottlenecks on a central lean organization | High - this relies on an end to end automated solution and an architecture that scales to 10x growth and sharing across architectures/cloud solutions | Low - relatively nascent in guidance and available toolsets | High - data platform team and product/domain teams need to be skilled up in data lakes. |
Cost versus complexity of cloud data lake architectures