In a prior article, I described the motivation for Intuit’s data mesh strategy. Intuit needs data-driven systems to enable more intelligent product experiences and the development tools and processes to enable more Intuit teams to create them more easily. The data mesh we built at Intuit has helped us to do that.
Since publishing the first article, we’ve deployed a data mesh in a small corner of Intuit’s large data estate. (To better understand some of the data mesh tools and experiences, see them in action in a talk by my colleague, Suresh Raman).
The solution we’ve created has delivered the following benefits:
- A 26% productivity improvement as measured by the time it takes for Intuit teams to discover, access, and explore data for a new project.
- A significant security improvement, as measured by the principle-of-least-access-privilege (PLAP) score (apologies for the lack of specifics, but Intuit’s security metrics are generally not publically disclosed).
- A 44% decrease in LLM hallucinations in our internal developer-facing chatbots.
NOTE — These improvements were determined by measuring outcome metrics (productivity, security, LLM hallucinations) in an environment where data, systems, and people leveraged the data mesh platform and comparing them to the same measures in an environment where data mesh was not used.
These numbers give us the confidence to scale from our “small corner” to the entirety of Intuit and see the same benefits, delivering a much more significant total impact for the company.
This article will detail the concepts that provide the foundation for Intuit’s data mesh implementation. The content of this article is derived from similar internal papers, presentations, and diagrams that document the target state architecture of Intuit’s data mesh implementation and are used as the primary artifacts for generating requirements documents, user flows, experience designs, and service APIs that collectively implement Intuit’s data mesh.
I hope you might use this in the same way my teams have — as a guide for implementing the experiences, services, and processes necessary to realize the benefits of a data mesh.
In future work, and depending partly on what kind of feedback I receive from this article, I may open-source Intuit’s internal API definitions and service implementations to develop a standard approach for implementing a data mesh across the industry.
Special thanks to my partners at Intuit who have contributed to the insights and ideas that have shaped this approach: Shachar Bar David, Stephen Molloy, Yaniv Levi, Daniel Sharvit, Krithika Swaminathan, Achal Kumar, Arun Ragothaman, Narayanan Singaram, Juhi Dhingra, Rui Dai, JD Rosensweig, Adi Ohana, Allison Bellah, Samuel Knapp, Robert Mei, Timur Fayruzov, Larry Raab, Sreekanth Martha, Barry Nisly, Ron Sher, Karen Maciolek, Dunja Panic, Ashish Page, Omar Abdelmagid, Robin Oliva-Kraft, Elmer KimNii, Jainik Vora, Saikiran Thunuguntla
This section enumerates all the critical concepts and their relationships to each other. Where helpful, entity relationship diagrams are provided to more clearly illustrate what a term means and how it is used.
1 — Business Unit
A distinct part of the organization is responsible for specific functions or objectives and comprises one or more teams. Business units are also responsible for paying fees for utilizing resources or services. This financial responsibility ensures that each business unit is accountable for its consumption of resources and encourages efficient usage of the data platform and its services.
2 — Team
A group of individuals backed by a business unit, working together to achieve shared goals. This group is realized as the owning Team of an Intuit Dev Portal.
3 — Intuit Dev Portal Project
Intuit Dev Portal is an internal tool for development teams to help them create products and services.
An Intuit Dev Portal Project is the primary organizing unit in the Intuit Dev Portal. It is a collaborative workspace where team members can manage, develop, and share resources related to their data work. Typically, a project represents a specific initiative owned by a team within a business unit. The Intuit Dev Portal Project is the central point of reference to connect deployed resources to the responsible owners.
3.1 — Human Role
A role represents a group of individuals within an Intuit Dev Portal Project Team who fulfill specific job functions within the project and the owning team. Each role is associated with credentials that grant authorized access to consume data, produce data, or both.
4 — Data Stewards
Data Stewards are accountable for producing, operating, and managing their data per Intuit’s Data Stewardship guidelines. There are two types of data stewards:
4.1 — Business Data Steward
Responsible for the definition and quality expectations of data that a system produces. The “Data Steward” human role is used by the Business Data Stewards on the team.
4.2 — Operational Data Steward
Responsible for creating and deploying systems that produce data according to stated data definitions and quality expectations. Operational Data Stewards use the “Developer” human role.
5 — Domain
A logical grouping of related logical boundaries. Domains provide a standardized approach to organizing data in the Data Map.
6 — Bounded Context
A logical boundary within a specific domain or subdomain that isolates its data from other contexts. Bounded contexts also serve as the primary way to browse and discover data in the Data Map, making it easier for users to navigate and understand the relationships between different data elements.
7 — Data Model
A data model is a formal description of a related set of concepts and definitions and their relationships to each other.
7.1 — Semantic Model
A specific kind of data model that conveys the semantic meaning of data and relationships within a bounded context. It is decoupled from but related to a Data Schema that describes this data model’s representation in a specific hosting location.
These models follow domain-driven design methodologies and typically define aggregate roots, entities, value objects, etc.
7.2 — Data Schema
A formal representation of the structure of a data model when stored in a particular hosting location. The data schemas are informed by and should be consistent with the semantic model but may deviate to meet the format requirements and access patterns of the hosting location where the data is stored. The data schema may have semantic and compliance annotations that provide additional context and meaning. The data schema helps ensure that the data is appropriately managed and processed, minimizing the risk of regulatory or policy violations. Examples of data schemas include JSON-Schema, Hive DDL, etc.
8 — Data Resource
A Data Resource refers to data managed by a data persistence system. This includes but is not limited to a table in a database, a file in a file system, an s3 object in an s3 bucket, a topic in a messaging system, etc. A data resource becomes a data asset when registered and provided an IRN.
9 — Hosting Location
A system that facilitates the persistence and serving of data assets by adhering to the hosting location requirements, including enforcing a centralized access change control process. This ensures a controlled and secure data exchange with other teams, promoting efficient collaboration and data management across the company.
Hosting locations include systems like Intuit’s Customer Data Lake, Operational Data Lake, Event Bus, and Customer Data Profile Store.
9.1 — Data Protocols
The supported ways a hosting location can be interacted with to mutate, store, and retrieve data. This includes but is not limited to, Kafka protocol, GraphQL RPC protocol, HTTP+REST protocol, HTTP+S3 protocol, etc.
9.2 — Catalog
A hosting location’s catalog maintains a registry of schemas used to describe the data assets stored within it. The catalog enables users and systems to quickly understand and work with the data stored in the Hosting Location. For example, an instance of a Hive Metastore is Intuit’s data lake catalog.
9.3 — Tenant
A hosting location tenant refers to a team granted access to manage its data resources within a shared hosting location environment. Each tenant has dedicated data resources, ports, and access control policies, ensuring that their data exchange processes are isolated from those of other tenants.
The following diagram illustrates a hosting location, the team, the hosted data resource, and the data port serving the data:
10 — Data Access
Data access refers to the process and resources for granting or restricting the right to interact with specific data resources within hosting locations. It considers human and machine access, ensuring a clear and consistent approach to managing access privileges to hosting locations.
10.1 — Data Access List
A data access list is a collection of data access entries defining access control privileges for a particular data asset. It specifies the allowed human and machine credentials and their respective access privileges to the data asset.
10.2 — Data Access Entry
A data access entry is an individual record within a data access list. It enumerates a set of specific credentials and corresponding access privileges granted to actors bearing those credentials. Each entry contains information such as a credential identifier, type (e.g., intuit authentication ticket, AWS role), and level of access granted (e.g., read, write ).
10.3 — Credential
A credential is a unique identifier assigned to a data processing system or a human role within an Intuit dev portal project team. These credentials facilitate the secure and controlled data exchange between data producers and consumers.
10.4 — Project-Scoped Credential
A project-scoped credential adheres to the requirement that the credential is only used to write to data assets owned by the same project as the credential. A credential that fails to meet this requirement is misused and contributes to problematic behavior, including creating data in a complex way to govern or attribute ownership.
11 — Data Asset
A data asset formally represents a data resource that aggregates essential information such as ownership, access control list, reference to the end-to-end data lineage, and data port information such as the connection to a hive table, the connection to an event bus topic, etc.
11.1 — IRN
An Intuit Resource Name (IRN) is a unique identifier used within Intuit’s data ecosystem to identify and reference data assets. The IRN is designed to provide a standardized naming convention, making it easier to locate, manage, and track data assets across various systems and services.
11.2 — Data Port
The description of the interface provided by hosting locations enables controlled data exchange of a particular data asset owned by a specific data steward team. The data port facilitates this exchange while adhering to access policies administered and governed by the central data platform, ensuring secure and compliant data exchange across the company. This interface can serve data from data ports (such as a GraphQL endpoint or Kafka topic) or receive data from data ports (such as commands or mutation via an HTTP API).
Fig: An illustration of the data asset entity
12 — End-to-End Data Lineage
End-to-end data lineage is a directed graph representing the relationships and dependencies between data assets and development assets. It provides visibility into the flow, transformation, and origin of data throughout its lifecycle, potentially going down to the attribute level. This holistic view enables better traceability, understanding, and data management across systems and teams. The end-to-end data lineage is a graph of nodes and declared or observed edges.
12.1 — Backward Data Lineage
Backward data lineage refers to the ability to trace data’s origin, movement, and transformation across its lifecycle, from its creation to its final use in data products. This includes tracking how the data was collected, transformed, and integrated with other sources to create the final data product.
12.2 — Forward Data Lineage
Forward data lineage refers to tracing the downstream flow of data from its origin through various processes, transformations, and integrations that lead to its consumption in data-consuming systems. It allows data stewards to understand the impact of changes to a specific data port on the dependent data-consuming systems, enabling better management of data quality, governance, and regulatory compliance.
12.3 — Declared Data Lineage
Declared data lineage refers to tracing the declarations of data usage as described in data access control lists.
12.4 — Observed Data Lineage
Observed data lineage refers to tracing actual data read and write activity, as inferred from processing systems and hosting location access logs, which are the source of truth for data lineage.
The following diagram depicts a prototypical data lineage graph:
13 — Data Recipe
A data recipe formally represents deployment operations needed to set up a data processing system to produce a data product. Data recipes reference data modules. Data modules describe how to provision and connect hosting locations and processing infrastructure.
13.1 — Data Transformation Module
Modules that define how data is processed, transformed, and produced to hosting locations to meet specific business requirements. This includes but is not limited to training processing tasks or inference tasks.
13.2 — Data Movement Module
Modules that specify how data is transferred, but not transformed by processing systems, and ingested between hosting locations.
13.3 — Data Observability Module
Modules that establish monitoring, logging, and alerting mechanisms for tracking the performance, quality, and health of data processes and data assets. This includes but is not limited to data quality rules, data parity, security policy violations, etc.
These data modules work together within a data recipe to streamline and automate data processing tasks, making it easier for teams to develop a processing system declaratively without getting bogged down in infrastructure complexities.
The following diagram depicts the Data Recipe entity and its three types of data modules:
14 — Data Product
Data Products are the foundational data unit of the Data Mesh and are organized by bounded context and data domains. A data product consists of one or more data assets. It consolidates essential information such as ownership, semantic model, data schema, service level agreements (SLAs), star rating, tags, and data recipe. It is important to note that data products are only discoverable once they meet the minimum 3-star threshold for clean data as determined in the clean data maturity model. This rating ensures that the data product is sufficiently described and managed such that it will facilitate cross-team collaboration and reliable data exchange. A data product definition is required for every consumable data asset. An example of a data product could be the “Customer Account Data Product”, describing a semantic model of a “Customer”, realized as a real-time stream of customer-account-created events in an event bus topic and as rows of customer-account-created events in a data lake table.
14.1 — Name
The name attribute of a data product is a human-readable display name, enabling easy understanding and communication among data stewards and consumers about its content and purpose.
14.2 — Tag
Tags are descriptive labels or keywords associated with a data product to enhance its discoverability and facilitate easier search and identification within Data Map discovery experiences.
14.3 — Service Level Objective
A service level objective (SLO) is a measurable target that defines the expected level of performance or reliability for a specific aspect of a data product. Common SLOs include data quality, data freshness, and data availability. The SLO enables stakeholders to set clear and quantifiable targets, monitor performance against those targets, and make data-driven decisions to improve performance and reliability.
14.4 — Service Level Agreement
A service level agreement (SLA) is a contract between a data steward and its consumers that specifies the set of SLOs for the data product. It is precise enough that measuring adherence to the SLA is automatable. The SLA enables the consumer to monitor the data product’s performance against agreed-upon targets and make data-driven decisions to manage their business operations.
14.5 — Data Recipe Reference
A reference to a data recipe that represents the producing task involved in moving, transforming, and observing the data of a data product.
The following diagram illustrates the various components of the data product entity:
15 — Data Asset Kinds
Data asset kinds categorize data based on its intended use, access level, and potential consumers. The data asset kinds include:
15.1 — Internal Data Assets
Internal data assets refer to data generated, managed, and consumed within a specific team. This type of data is often used for operational activities. Access to internal data is restricted to authorized team members and is not intended for external consumption. Internal data must meet two stars on the clean data maturity model, which requires the appropriate metadata to address compliance and security concerns to which all data is subject.
15.1.3 — Temporary Data
Temporary data refers to generated data that is not meant to be long-living. It is typically generated during data exploration for internal decision-making and analysis. It is often data that does not have a regularly scheduled pipeline associated, is not monitored, and is not consumed (or should not be). Examples include data created by one-off “CREATE TABLE” queries during data exploration or one-off model experiments that generate data but are not promoted to a production model. Temporary data must meet two stars on the clean data maturity model, which requires the appropriate metadata to address compliance and security concerns to which all data is subject.
15.2 — Consumable Data
Consumable data asset is data made available for cross-team data exchange. It should be understandable, durable, and usable by various consumers, both internal and external to the team. It is measured according to the clean data maturity model and subject to the discoverability and consumption constraints described therein. Consumable data facilitates cross-team collaboration, data exchange, and integration with other systems or applications.
The following diagram depicts the data kinds and their sub-classifications:
16 — Data Registration
The workflows and processes for data producers to register data assets and promote them into data products. This includes workflows for internal and consumable data.
16.1 — Data Product Lifecycle
The lifecycle of data products includes:
16.2 — Data Product Intent
Data Products can be intended for broad consumption or internal use.
16.3 — Clean Data Governance
Data Registration uses the clean data maturity model to govern and rate data. Data governance depends on the data kind and will be the gatekeeper for what is discoverable and accessible to users.
16.3.1 — Consumable Data Threshold
Data products with the consumable intent are considered released and have a clean data maturity star rating of 3 or higher.
16.4 — Clean Data Maturity Star Rating
A quantitative metric, calculated based on the clean data maturity model, evaluates a data product’s overall quality, reliability, and usability. This star rating considers various aspects, such as data cleanliness, schema, semantic context, and adherence to service level agreements (SLAs). The clean data maturity star rating creates a clear baseline of expectations for data stewards to meet while also providing consumers with transparency about a data product’s strengths and weaknesses so that they can make informed decisions when selecting data products for consumption.
16.5 — Clean Data
Data that meets the 3-star threshold defined in Intuit’s clean data maturity model for consumable data.
17 — Data Map
User experiences for discovering consumable data products are organized by domain, subdomain, and bounded context. Note that internal data is generally not discoverable or accessible except by the team that owns it. Thus, any Data Map experiences that assist in discovering internal data must consider the profile of the person doing the searching and the set of data assets they own.
17.1 — Data Map Registry
A registry of internal and consumable data assets includes a graph of declared & inferred data lineage, internal and consumable data assets, data products, domains, subdomains, bounded contexts, and data models.
17.2 — Data Map Taxonomy
A data domain, subdomain, and bounded context define a taxonomy that organizes all the data products in the data map.
18 — Data Processing Runtime
A data processing runtime refers to any runtime that enables the execution of data processing tasks, such as data ingestion, transformation, analysis, and storage. These runtimes (e.g., spark batch processing, Kubernetes microservice, and flink runtimes) provide the tools, libraries, and resources to efficiently process and manipulate large data assets in real-time or batch mode.
19 — Customer Environment
A customer environment comprises production-grade components for teams to use to develop and test their data products. This includes the data processing runtimes and hosting locations that enable customers to deploy, test, and publish data products. The processing runtimes within the customer environment are stable and reliable, providing a standardized and scalable approach to managing the deployment and operation of data products.
20 — Experimentation Session
An experimentation session enables teams to test new ideas using data assets. This testing is generally an iterative process of analysis and refinement that produces insights into optimizing business processes, products, services, and features. It involves collecting and analyzing data, executing experiments, and drawing conclusions to inform decision-making. This session is controlled by the central data platform, ensuring compliance and proper management of data assets and processes.
21 — Data Processing System
A deployable unit of code and infrastructure that can consume and produce data assets.
21.1 — Data-Producing System
A system that generates internal or consumable data assets. Examples include, but are not limited to, microservices and ETL pipelines.
21.2 — Data-Consuming System
A data-consuming system is a component that accesses data assets. These systems can consume internal data assets produced by the same system and data products generated by other systems.
This section provides a better understanding of the interrelations between the various concepts as they are typically arranged for various use cases.
Use Case 1— A microservice publishes events to a hosting location.
This data-producing system generates event data, typically describing specific business operations that have recently occurred. The microservice then publishes these events to designated hosting locations, making them available for other systems or teams to consume. The diagram refers to an “outbox service”, which is a recipe module that implements the transactional outbox pattern for reliably sending events in a distributed, eventually consistent microservices architecture.
Use Case 2— A frontend application publishes clickstream data to hosting locations.
This data-producing system captures user intent on a web application, such as clicks, scrolls, and navigation events. The application then publishes the collected clickstream data to hosting locations, providing valuable insights into user behavior and preferences.
Use Case 3— An ETL pipeline transforms a data product to create a new data product.
This data-producing system transforms existing data products into derivations tailored to specific consumption use cases. The pipeline then moves the transformed data to hosting locations, making it available for further analysis and consumption by other teams or systems.
The following diagram illustrates the C4 model system context of the data mesh domain (at Intuit, we refer to this officially as the “L0 Data” domain, hence the references to “L0 Data” in some of the following diagrams):
Entity Relationship Diagram
The following diagram depicts the representation of the key domain concepts and their relationships. This diagram helps in understanding how the different components of the domain interact with each other. This diagram serves as a guide for further exploration and refinement of the underlying structure and interdependencies within the data mesh domain:
Lockdown hosting locations — The platform must retain administrative control of hosting locations so that data access (or lack thereof) can be used to enforce compliance with standards. As we like to say at Intuit, “if you don’t follow the rules, you can’t touch the data.”
Develop data maturity standards — Develop and document a comprehensive data maturity framework that includes schema management, data models, access policies, data quality, and observability controls. This framework will serve as a guideline for data products to adhere to various levels of standardization based on the data maturity decision.
Facilitate data registration — Design and implement the architecture to facilitate the registration of domains, bounded contexts, data models, data assets, and data products, enabling data stewards to define and document their data. A system of checks should be in place to ensure that new domains and new data products do not conflict with or duplicate existing domains and data products. We’ve found that duplication of data products is a leading driver of confusion among the general data-consuming population, which is why it has been added as a criterion in the data maturity standard.
Facilitate the transition of existing data to data assets and data products — Streamline the process for existing data producers to adopt the data maturity framework by providing clear guidance and interfaces for creating data products from their existing data assets. Simplifying this transition will encourage adherence to data maturity standards that support durable data integration and collaboration across teams.
Facilitate data lineage registration — Streamline the process of registration data lineage for each data product so that data consumers can trace the data’s origin and stewards can understand data consumption and usage.
Facilitate data recipe interfaces to accelerate data development — To support adopting and using data recipes, provide intuitive and easy-to-use interfaces for creating, managing, and deploying data recipes. These interfaces should allow data stewards, service developers, data engineers, and analysts to define and configure data transformations, movements, and observability modules with minimal effort, embracing low-code or no-code principles.
Support internal data exchange and persistence — Empower data producers to create and store secure, high-quality internal data within hosting locations while promoting efficient data exchange within the same team.
The following diagram depicts the prototypical architecture of an internal data data-producing system:
Apply the principle of least access privileges — Ensure that credentials, for either machine or human access, are granted the minimum set of permissions necessary to perform their consumption tasks. Continuously review and update user roles and permissions to maintain adherence to the PLAP. Providing data stewards with easy-to-use interfaces and processes that enable them to revoke and grant access to data consumers easily is crucial to ensuring efficient and secure data access management.
Key Performance Metrics
The following set of metrics has been developed to measure the value delivered by the data mesh and its collection of capabilities.
Clean Data Adherence Rate (CDAR)
This metric measures the extent to which data producers within the company adhere to the clean data maturity model. It is calculated as the ratio of the total number of data product connections to the total number of data connections between different teams within the company during a given period. A higher CDAR indicates better adherence to the clean data maturity model, thereby improving the overall data quality and reducing the risk of internal data consumption by humans or machines of different teams. CDAR can help identify potential areas for improvement and inform the development teams of the data platform policies and training programs for data producers.
Schema Management Efficiency (SME)
This metric describes the effectiveness of the data modeling capability in reducing the time it takes for customers to manage their schemas while complementing their existing user journeys. This metric can be measured by monitoring the time spent on schema management tasks, the number of schema modifications, and the user satisfaction rate related to schema management.
Principle of Least Privileges Score (PLAP)
This metric assesses how closely an organization’s data access policies comply with the principle of least access privilege. This principle requires that humans and machines are granted only the minimum necessary permissions to perform their required tasks, thus reducing the risk of unauthorized access and enhancing data security. PLAP evaluates an organization’s adherence to this principle by measuring the total number of privileges granted to credentials against the actual access usage for these credentials.
Hosting Location Lockdown Score ( HLLS )
This metric quantifies the extent to which data access control policies are enforced. It is calculated by comparing the number of credentials that strictly adhere to the rule of allowing data-producing systems to write data only to the data assets associated with the same project as the credential and the data-producing system against the total number of credentials that can write data to the same data assets. This score provides insights into the effectiveness of the data access policy enforcement and helps identify areas where improvements may be needed to ensure proper data read and write activities.
The diagrams below show an example of illegal data reading and writing that would reduce the hosting location's lockdown score.
Developer Velocity ( DPV )
This metric measures the speed and efficiency at which developers can register data definitions, manage access requests, and release data-producing systems. This metric is essential for tracking the overall productivity of the development team and ensuring that the architecture supports agile development processes, enabling teams to quickly adapt to changing requirements or new challenges.
Data Access Request Time (DART)
A performance metric that measures the time it takes for humans to access requested data. It is calculated as the time taken to process a data access request, which starts when the request is submitted and ends when the user receives an access decision. A lower DART value indicates faster data access and a more efficient data management process.
Clean Data Maturity Model
The clean data maturity model describes a rating system for data assets and an accompanying policy about how the data can or cannot be used.
Criteria — Data exists, but there’s no information available to describe it.
Usage policy — Not discoverable or consumable.
Criteria — An owner is identified, and information exists to understand its security and compliance requirements.
Usage policy — Not discoverable or consumable.
Criteria — Same as above, plus documentation exists to describe the meaning of the data. Data is stored in a system compliant with the hosting location standard, which tightly controls access.
Usage policy — Discoverable, but only consumable with an exception.
Criteria — Same as above, plus the owner is more active and accountable (i.e., a ‘data steward’); modeling language standards are followed; data quality monitoring ensures operational reliability; lineage describes upstream and downstream dependencies; change control process ensures no breaking changes; data is broadly is ready for broad use across a large audience of consumers.
Usage policy — Consumable
Criteria — Same as above, plus the data product follows modeling design standards, making it more easily processed by any number of systems across the company.
Usage policy — Consumable
Criteria — Same as above, plus relationships to other data products are precisely defined; the data product is unique and non-conflicting among all data products at the company.
Usage policy — Consumable
Example Data Model
The following diagram describes how a data model is used to model the “Invoicing Workflows” domain.
The “Invoicing Workflows” bounded context consists of a domain entity schema and two events: an Invoice notification event, which is used to notify on Invoice state changes, and an event-carried state transfer event, which is used to carry the entire state that changed.