What is a Data Mesh?
A data mesh is an architectural framework that solves advanced data security challenges through distributed, decentralized ownership. Organizations have multiple data sources from different lines of business that must be integrated for analytics. A data mesh architecture effectively unites the disparate data sources and links them together through centrally managed data sharing and governance guidelines. Business functions can maintain control over how shared data is accessed, who accesses it, and in what formats it’s accessed. A data mesh adds complexities to architecture but also brings efficiency by improving data access, security, and scalability.
What challenges does a data mesh solve?
Even though organizations have access to ever-increasing data volume, they have to sort, filter, process, and analyze the data to derive practical benefits. Organizations often utilize a central team of engineers and scientists for managing data. The team uses a centralized data platform for the following purposes:
- Ingest the data from all the different business units (or business domains).
- Transform the data into a consistent, trustworthy, and useful format. For example, the team could make sure all dates in the system are in a common format or summarize daily reports.
- Prepare the data for data consumers, like by generating reports for humans or preparing XML files for applications. Read about XML »
As data volume increases, organizations face increasing costs to maintain the same agility as before. The monolithic system is difficult to scale because of the following reasons.
Siloed data team
The central data team has specialist data scientists and engineers with limited business and domain knowledge. However, they still have to provide data for a diverse set of operational and analytical needs without a clear understanding of motivation.
Slow responsiveness to change
Data engineers typically implement pipelines that ingest the data and transform it over several steps before storing it in a central data lake. Any requested changes require modifications to the entire pipeline. The central team has to make these changes while managing conflicting priorities and with limited business domain knowledge.
Reduced accuracy
Business units are disconnected from the data consumers and the central data teams. As a result, they lack the incentive to provide meaningful, correct, and useful data.
What are the benefits of a data mesh?
Over time, a data platform architecture could result in frustrated data consumers, disconnected data producers, and an overloaded data management team. Data mesh architecture attempts to solve these challenges by empowering business units to have high autonomy and ownership of their data domain. The benefits of data mesh architecture are given below.
Democratic data processing
A data mesh transfers data control to domain experts who create meaningful data products within a decentralized governance framework. Data consumers also request access to the data products and seek approvals or changes directly from data owners. As a result, everyone gets faster access to relevant data, and faster access improves business agility.
Increased flexibility
Centralized data infrastructure is more complex and requires collaboration to maintain and modify. Instead, the data mesh reorganizes the technical implementation of the central system to the business domains. This removes central data pipelines and reduces operational bottlenecks and technical strains on the system.
Cost efficiency
Distributed data architecture moves away from batch processing, instead promoting real-time data streaming adoption. You improve visibility into resource allocation and storage costs, resulting in better budgeting and reduced costs.
Read more about data architectures.
Improved data discovery
A data mesh model prevents data silos from forming around central engineering teams. It also reduces the risk of data assets getting locked within different business domain systems. Instead, the central data management framework governs and records the data available in the organization. For example, domain teams automatically register their data in a central registry.
Strengthened security and compliance
Data mesh architectures enforce data security policies both within and between domains. They provide centralized monitoring and auditing of the data sharing process. For example, you can enforce log and trace data requirements on all domains. Your auditors can observe the usage and frequency of data access.
What are the use cases of a data mesh?
A data mesh can support all types of big data use cases. We give some examples below.
Data analytics
Multiple business functions provision trusted, high-quality data for your data analytics workloads. Your teams can use the data to create customized business intelligence dashboards showcasing project performance, marketing results, and operational data. Data scientists can accelerate machine learning projects to derive the full benefits of automation.
Customer care
A data mesh provides a comprehensive view of customers for support and marketing teams. For example, support teams can pull relevant data and reduce average handle time, and marketing teams can ensure they target the right customer demographics in their campaigns.
Regulatory reporting
The need for volume, timeliness, and accuracy in data that meets regulatory objectives places challenges on both regulators and regulated firms. All parties can benefit from the application of data mesh technologies. For example, organizations can push reporting data into a data mesh centrally governed by regulators.
Third-party data
You can apply data mesh technology for use cases that require third-party and public datasets. You can treat external data as a separate domain and implement it in the mesh to ensure consistency with internal datasets.
What are the principles of data mesh architecture?
Your organization must implement the following four principles to adopt the data mesh paradigm.
Distributed domain-driven architecture
The data mesh approach proposes that data management responsibility is organized around business functions or domains. Domain teams are responsible for collecting, transforming, and providing data related to or created by their business functions. Instead of domain data flowing from data sources into a central data platform, a specific team hosts and serves its datasets in an easily consumable way. For example, a retailer could have a clothing domain with data about their clothing products and a website behavior domain that contains site visitor behavior analytics.
Data as a product
For a data mesh implementation to be successful, every domain team needs to apply product thinking to the datasets they provide. They must consider their data assets as their products and the rest of the organization's business and data teams as their customers.
For the best user experience, the domain data products should have the following basic qualities.
Discoverable
Each data product registers itself with a centralized data catalog for easy discoverability.
Addressable
Every data product should have a unique address that helps data consumers access it programmatically. The address typically follows centrally decided naming standards within the organization.
Trustworthy
Data products define acceptable service-level objectives around how closely the data reflects the reality of the events it documents. For example, the orders domain could publish data after verifying a customer’s address and phone number.
Self-describing
All data products have well-described syntax and semantics that follow standard naming conventions determined by the organization.
Self-serve data infrastructure
A distributed data architecture requires every domain to set up its own data pipeline to clean, filter, and load its own data products. A data mesh introduces the concept of a self-serve data platform to avoid duplication of efforts. Data engineers set up technologies so that all business units can process and store their data products. Self-serve infrastructure thus allows a division of responsibility. Data engineering teams manage the technology while business teams manage the data.
Federated data governance
Data mesh architectures implement security as a shared responsibility within the organization. Leadership determines global standards and policies that you can apply across domains. At the same time, the decentralized data architecture allows a large degree of autonomy on standards and policy implementation within the domain.
How can you build a data mesh in your organization?
Data mesh is an emerging concept that only gained traction post-pandemic. Organizations are experimenting with different technologies as they attempt to build a data mesh for specific use cases. However, organization-wide adoption of enterprise data mesh is still rare. There is no clear path to data mesh implementation, but here are some suggestions.
Analyze your existing data
Before building a data mesh, you must catalog your existing data and identify relevant business domains. Following certain harmonization rules is the key to the effective correlation of data between domains. For example, you will need to define global standards for field type formatting, metadata fields, and data product address conventions.
Implement global data governance policies
Federated data governance requires your central IT team to identify reporting, authentication, and compliance standards for the data mesh. You can also define granular access controls that data product owners apply when hosting their datasets. While data producers define and measure data quality, central governance policies help guide their decisions.
Build your self-serve data platform
Your self-serve data platform should be generic, so anyone can build new domain data products on it. It should also hide underlying technical complexity and provide infrastructure components in a self-serve manner. Here are some capabilities to include:
- Data encryption
- Data product schema
- Governance and access control
- Data product discovery, such as catalog registration or publishing
- Data product logging and monitoring
- Caching for improved performance
You can also build automation, such as configurations and scripts, to lower the lead time to create data products.
Choose the right technologies
Your existing traditional storage systems, like data warehouses and data lakes, can also power your data mesh. You just have to shift their use from monolithic systems to multiple decentralized data repositories. A data mesh also enables the adoption of cloud platforms and cloud-centered technologies. Cloud infrastructure reduces operational costs and the effort required to build a data mesh. You must choose a cloud provider with rich data management services to support your data mesh architecture. You will also need to consider data integration requirements with legacy systems.
Start an organization-wide cultural shift
Today we have the technology and tools required to easily build a data mesh with multiple data products. The shift towards the unification of batch and streaming is now easier than ever with tools like Amazon EMR. However, scaling your data mesh beyond small projects necessitates a paradigm shift away from the centralized data architectures of the past. It requires a new language that emphasizes the following:
- Data discovery and usage over extraction and loading
- Real-time data processing over high-volume batch processing at a later date
- Distributed data product ownership over central data platform architecture
Currently, data technology often drives architectural decisions. A data mesh reverses this flow, putting domain data products in the center so that they drive technology decisions instead.
What is the difference between a data mesh and a data lake?
A data lake is a repository where you can store all your structured and unstructured data without any pre-processing and at any scale. In centralized data platforms, the data lake is the core technology for storing data from all possible sources.
A data mesh is a data management paradigm that uses data lakes differently. A data lake is no longer the centerpiece of the whole architecture. Instead, you can use it to implement data products or as a part of the self-serve infrastructure.
What is the difference between data mesh and data fabric?
A data fabric is another modern architecture that uses machine learning and automation for end-to-end integration of various cloud environments and data pipelines. You can think of it as a technology layer over your underlying infrastructure that cohesively integrates and presents data to non-technical users. For example, decision-makers use the data fabric to view all their data in one place and make connections between disparate datasets.
Both data fabric and data mesh have similar goals—unified and effective data management. For instance, let's say you have a central data lake and use AWS services for data ingestion. At the same time, you have legacy infrastructure for data transformations. Your data fabric integrates both systems and presents a unified view without changing the existing pipeline.
A data fabric thus uses technology to work with your existing infrastructure. On the other hand, a data mesh implementation requires you to change the underlying infrastructure itself. You have to change your data management's push-and-ingest model to a serve-and-pull model across your business domains.
How can AWS support your data mesh architectures?
Modern Data Architecture on AWS lists several services you can use to implement data mesh and other modern data architectures in your organization. You can rapidly build data products and data mesh infrastructure at a low cost without compromising performance.
Here are examples of AWS services you can use:
- Use AWS Lake Formation to build data mesh pattern at scale with tag-based access control
- Use AWS Data Exchange to integrate third-party data into your data mesh
- Use AWS Glue for sharing, hosting, and cataloging data products
Get started with your data mesh on AWS by creating a free account today.