Introduction
In the evolving landscape of data engineering and data warehousing, two crucial concepts that often surface are the "curated layer" and the "landing layer". These architectural layers serve distinct purposes in organizing, processing, and safeguarding data. This blog post aims to unravel the intricacies of these terms, their appropriate use cases, and situations when their use may not be optimal.
What are Curated Layer and Landing Layer?
Landing Layer:
Landing Layer: The landing layer, also known as the "raw" or "staging" layer, is the initial storage location where data is ingested or landed from various sources. It is typically a transient storage area where the data is stored as-is, without significant processing or transformation. The landing layer often contains the most granular and unprocessed form of the data, such as raw log files, database backups, or data files from external systems. The purpose
Curated Layer:
The curated layer, also referred to as the "refined" or "processed" layer, is a more refined and structured storage area where the data undergoes cleansing, transformation, and enrichment. In this layer, data engineers apply data quality checks, perform data normalization, apply business rules, and aggregate or summarize data to make it suitable for analysis and reporting. The curated layer aims to provide reliable, high-quality data that is organized and optimized for specific use cases, such as data analytics, business intelligence, or machine learning.
Tracing the Origins: Understanding the Concept and Evolution of Landing, Raw, Staging, Curated, Refined, and Processed Layers in Data Engineering
The landing layer or staging area is a term that has been used in computing and data management for many years. The concept is based on the idea of a "landing" or "staging" area in the physical world. For instance, in logistics, a "staging area" is a place where goods are prepared and organized for further transportation or processing. Similarly, in the world of theatre, a "staging area" is where props and sets are prepared before being moved onto the stage for a performance.
The term raw data is also derived from the physical world. In many industries, "raw" materials are those that are in their natural state and haven't yet been processed or refined. In a similar vein, "raw" data refers to data that is in its original state, as it was captured at the source, without any transformations, corrections, or enhancements applied. This data is often unstructured and can contain errors or inconsistencies, which is why it typically undergoes various transformations and cleaning processes in the data warehouse before it's ready for analysis.
The term curated layer traces its roots back to the concept of "curation" which traditionally pertains to the role of a curator in museums or galleries, where they select and interpret items to create a meaningful exhibition. In the context of data, curation involves selecting, organizing, and maintaining data, ensuring its quality, relevance, and readiness for consumption.
When Should We Use Them?
Curated Layer:
The curated layer comes into play when there's a need for standardized, clean, and reliable data for analysis and reporting. This layer is beneficial in scenarios where:
- Complex transformations and business rules need to be applied to raw data.
- Data from various sources needs to be aggregated and conformed to provide a unified view.
- There's a need for a centralized data repository that can feed multiple downstream applications or processes.
Landing Layer:
The landing layer is essential in any data warehousing architecture as it serves as the gateway for data. This layer is crucial when:
- Data from multiple sources needs to be captured in its original form.
- There's a need to retain raw data for traceability, audit, or compliance purposes.
- Initial data quality checks need to be performed before any transformation.
When Shouldn't We Use Them?
Curated Layer:
While the curated layer provides numerous benefits, it might not always be the best fit. It should be avoided when:
- Real-time or near-real-time data analysis is required. The transformations and aggregations applied in the curated layer might lead to latency.
- The data use case is exploratory in nature and requires raw, unaltered data.
- Storage and processing resources are limited. Maintaining a curated layer can be resource-intensive due to the need for continuous data processing.
Landing Layer:
The landing layer, though fundamental to data warehousing, might not be necessary in situations where:
- The data sources are trusted, and data quality is assured, reducing the need for initial checks and audits.
- Data volume is low, and the overhead of maintaining a separate layer for raw data is not justified.
- Data is directly ingested into the system in a pre-transformed, ready-to-use format, bypassing the need for a landing layer.
In summary, the curated and landing layers are essential components of a well-structured data warehouse. They ensure a streamlined flow of data from the point of entry to its final use in reporting and analytics. However, their use is not without considerations, and an understanding of their purpose