Skip to main content

Data Storages

ArchitectureTotal cost of solutionFlexibility of scenariosComplexity of developmentMaturity of ecosystemOrganizational maturity required
Cloud data warehouseHigh - given cloud data warehouses rely on proprietary data formats and offer an end to end solution together, the cost is highLow - Cloud data warehouses are optimized for BI/SQL based scenarios, there is some support for data science/exploratory scenarios which is restrictive due to format constraintsLow - there is less moving parts and you can get started almost immediately with an end to end solutionHigh - for SQL/BI scenarios, Low - for other scenariosLow - the tools and ecosystem are largely well understood and ready to be consumed by organizations of any shape/size.
Modern data warehouseMedium - the data preparation and historical data can be moved to the data lake at lower cost, still need a cloud warehouse which is expensiveMedium - diverse ecosystem of tools nad more exploratory scenarios supported in the data lake, correlating data in the warehouse and data lake needs data copiesMedium - the data engineering team needs to ensure that the data lake design is efficient and scalable, plenty of guidance and considerations available, including this bookMedium - the data preparation and data engineering ecosystem, such as Spark/Hadoop has a higher maturity, tuning for performance and scale needed, High - for consumption via data warehouseMedium - the data platform team needs to be skilled up to understand the needs of the organization and make the right design choices at the least to support the needs of the organization
Data lakehouseLow - the data lake storage acts as the unified repository with no data movement required, compute engines are largely stateless and can be spun up and down on demandHigh - flexibility of running more scenarios with a diverse ecosystem enabling more exploratory analysis such as data science, and ease of sharing of data between BI and data science teamsMedium to High - careful choice of right datasets and the open data format needed to support the lakehouse architectureMedium to High - while technologies such as Delta Lake, Apache Iceberg, and Apache Hudi are gaining maturity and adoption, today, this architecture requires thoughtful designMedium to High - the data platform team needs to be skilled up to understand the needs of the organization and the technology choices that are still new
Data meshMedium - while the distributed design ensures cost is lower, lot of investment required in automation/blueprint/data governance solutionsHigh - flexibility in supporting different architectures and solutions in the same organization, and no bottlenecks on a central lean organizationHigh - this relies on an end to end automated solution and an architecture that scales to 10x growth and sharing across architectures/cloud solutionsLow - relatively nascent in guidance and available toolsetsHigh - data platform team and product/domain teams need to be skilled up in data lakes.

Cost versus complexity of cloud data lake architectures