Whether you work for a global enterprise or a local business, data quality is vital for analysis and reporting. I’ve talked about this before in How to Improve Data Quality and Designing for Data Protection but today I’m exploring a structured approach to data quality known as the medallion architecture.
Data can come from a variety of systems, including legacy systems that have been used for many years. Some well-designed relational database systems will have tight constraints to help improve data quality, but others may have a lot of errors. You therefore need a way of getting your data into a format that can be trusted by decision-makers.
This architecture comes from the world of big data and data lakes, but the ideas behind it are useful for all data projects. It’s called the medallion architecture, and it processes data in three distinct stages: Bronze, Silver, and Gold—just like Olympic medals. Clever, right?
Bronze layer: raw data
Data from source systems are imported “raw” without making any changes to the data. The purpose of this stage is to:
• Validate import integrity. Ensure no data are missing, the original schema has been preserved, data has not been corrupted, etc.
• Add meta data. Columns are added to identify the import date and time, originating system etc.
• Provide an audit trail. This bronze layer data is not modified, and so can be used to validate queries that emerge in later stages.
• Avoid re-importing. This initial stage might not be the most glamorous, but it provides the foundation for everything that follows and needs to be done with care. You want to avoid reimporting the data if problems emerge later.
Data are appended to the bronze layer periodically, and so files will increase in size over time. Data in the bronze layer is never accessed directly by business users, data scientists, or analysts. Instead, it forms the foundation for the silver and gold layers.
Silver layer: clean data
The silver layer uses data from the bronze layer and is never created directly from source data. The purpose of the silver layer is to:
• Clean the data. Fix issues such as missing or null values, deduplicating, dealing with out-of-range values, data types, normalization, and other data quality issues.
• Validate the data. Check no errors have been introduced by comparing and testing against the bronze layer.
• Normalize the data. Data may be split separate tables reading for processing at the gold layer.
Data is typically not aggregated at the silver layer but if aggregation is done, at least one non-aggregated record is preserved. Data at the silver layer might be used by data scientists or analysts. Business users would normally have access to the gold layer.
Gold layer: business-ready data
The gold layer is where data becomes business-reporting-ready. At this stage:
• Data is denormalized and aggregated according to business needs.
• Data models and measures are created, in line with how users want to query and analyse the data.
The gold layer is focussed on optimizing the data for business intelligence reporting. Data presented in a format suitable for business users to work with tools such as Power BI to create dashboards and reports.
Data-led decisions need gold-standard data
I may be overdoing the Olympic theme, but the concepts behind the medallion architecture are now considered best practice. It is built on good data practices that actually work. In summary:
1. Separate data ingestion and validation from later stages. Preserve this “raw” version of the data to create a base from which to validate future processing.
2. Manage data quality issues as an intermediate step before aggregating, modelling or creating measures.
3. Optimize for business use. Create aggregations and measures suited to business reporting requirements at the final stage.
If you are tackling a reporting project, moving data to the cloud, or need help improving your data’s reliability, get in touch for a chat. We have decades of experience in improving data quality and data manipulation. Together we could make a winning team to turn raw data into golden insights (alright, alright, enough).
No comments:
Post a Comment