Introduction & Overview
Measuring Data Quality
Data quality measures how accurately a set of data point values describe the properties of real-world things or events. There are six dimensions used to measure data quality:
- Completeness - What percentage of data point values are null or empty?
- Timeliness - Did a data point value occur within a specified period?
- Uniqueness - Does a data point value's occurrence match the number of times it occurs in the real world?
- Consistency - Do differences exist between two or more data point values that describe the same real-world object or event?
- Accuracy - Do data point values represent the properties of real-world objects or events?
- Validity - Does the data conform to a specified format, a required range of values, or particular data type?
Data quality assessment is the processes and activities used to measure and tabulate applicable data quality dimensions to determine whether or not the data meets the requirements for a particular business need. A data quality assessment might involve these steps and activities:
- Based upon the business need, determine the required level of data quality.
- Select data points for evaluation and the data quality dimensions to apply.
- Define data quality rules by assigning a set of required values to each data quality dimension
- Evaluate each item of data and either accept or reject it.
- Divide the data into two sets: accepted and rejected.
- If appropriate, make corrections to the rejected data sets and reevaluate.
Data quality assessments are costly and time-consuming to design, implement, and monitor. For example, consider a relatively small data set with just 500 records, each containing six data points. Applying the six data quality dimensions to each value in the data set requires the tabulation and evaluation of 18,000 individual measurements. On top of this, we need to extract data from an ever growing number of data sources, transform and enrich it, then promptly provide it to decision makers.
Why Data Quality Assessments are Expensive and Time-Consuming
For decades companies have processed data using a schema-on-write approach. First, you identify the data; then you define a schema for that data. This data schema details the data points to collect, their data type, format, and other attributes. Next, you access the data and either store it for later retrieval or load directly into an application. Whether you save it or load it, the database or application demands and only understands the agreed upon schema. Schema-on-write is the cornerstone of modern databases, object oriented development, and application programming interfaces (APIs). It is required for many types of applications, especially those that involve processing transactional data.
But schema-on-write is inadequate for the data quality needs of contemporary analytics and business intelligence applications that are required to process an ever growing volume of diverse data sources. It forces business analysts and developers to anticipate the needs of decision makers in advance then define a schema that mostly, but not entirely, meets their need. Since this is almost always impossible, data scientists and analysts spend much of their time doing low-level data munging.
Enter schema-on-read, a set of technologies developed to load data "as-is" then allow users to apply a schema based on their needs. Schema-on-read includes technologies like NoSQL databases, i.e. MongoDB or Redis, and distributed file systems such as HDFS. These have certainly been a welcome advancement, but they are not a panacea. Often, schema-on-read technologies only push the problem further out to the end user of data and thereby incur a whole new set of issues and costs. Also, schema-read-does does little to redress data quality issues.
Generic Data - The Schema-In-Between
The issue is not the way data is written or read. Consider that each of us can process a constant stream of data without thinking about its underlying structure because we view it generically. We don't need to know if "one" refers to a person or thing or the lowest cardinal number. We automatically interpret the meaning of data based on the context or required action. Flow's generic data technology applies a similar approach. It focuses on the processing context, required operations, or expected results to dynamically select the appropriate schema.
Generic data allows Flow to become the schema-in-between diverse data sources or destinations and applications. If a data source provides data of a particular type or format, but the context, action, or result requires a different data type, Flow dynamically retypes the data as needed. This generic data approach neatly separates Flow's interfaces and processing logic from schema-on-write data sources. It also makes Flow a powerful complement to schema-on-read data sources such as MongoDB.
Data Quality and Flow
Flow is not a data quality application per se. However, Flow provides many lower level functions and operations that can be combined and configured to perform nearly any data quality task. Also, Flow incorporates many features designed specifically for data quality which in turn means more accurate analytics, higher quality reports, and dashboards with reliable information.
Of course, I believe the best way to show something is actually to do it (those who can't do it, market it). So using some sample name and address data, I'll walk through building a flow app that performs a basic data quality assessment.
A Basic Data Quality Assessment Flow
For most businesses, high-quality name and address data are essential for success. In this example, I'll build a Flow that performs a data quality assessment of some fictional name and address data. I will start by showing how to perform some initial data quality actions. First, I will demonstrate adding an action to identify and remove null or empty data elements. Then, I will present how to profile a data set. Next, I'll walk through how to configure and evaluate a set of data quality rules. Finally, I will show how to build a clean data set.
If you wish to try this Flow yourself, I have made the sample address data and the Flow application available. Instructions on how to access them are here.
Here are the basic steps I'll cover below.
- 1 Add the "Data Quality Assessment" Flow
- 2 Load the sample name and address data
- 4 Add an action to remove blank data items
- 5 Add an action to profile the data set
- 6 Create and evaluate a set of data quality items
- 7 Generate a set of clean name and address data
1. Add a New Workflow
To get started, we'll need to add a new Flow, so click the Add Flow button in the upper left toolbar. The Add New Workflow dialog will display, as shown on the right.
Enter a name, in this case, "Data Quality Check," then click OK. The new Flow will appear in the list on the left-side of your screen.
To open the Flow for editing, simply double click its name. Alternatively, you can right-click the name and then select Open from the context menu; its name will appear as shown to the right.
2. Load Data Action
A Flow always starts with a load data action. So click on the Actions menu then select the Load Dataset menu item. The Load Dataset Dialog will appear as shown to the right.
Select the "Addresses" data set from the drop-down list then click the OK button to add the action.
Note: If you the Addresses data is not in the list, you may need to add it to your account. Click here for more information.
After you click OK, the Load Data action will appear in the Action Editor.
To run this action, click the Run button in the Action Editor toolbar, then click Yes to the Run All prompt. The Flow run-time engine will execute the action and load the sample data which appears in the Working Data tab as shown.
3. Remove Blank Data Action
The first thing we want to do is remove any records containing blank or null values in required fields. In this case, we'll define "blank data" as any record where the telephone, address 1, first name, last name, and email fields are all null or empty. There are several ways to do this, but the easiest is the Remove Blank Data action. This action allows you to designate a set of fields to test for empty values. You can define blank to mean any one field is null or empty, or require all of them to be null or empty.
Add Remove Blank Data Action
Click on the Actions menu then select the Data Points => Remove Blank Values menu item. The Remove Blank Values dialog will appear. Follow the steps detailed below to add the Remove Blank Values action.
- Select the "Addresses" dataset from the Working Data drop-down.
- Check the "All selected data points must be blank or null" box.
- Select the Telephone data point then click the green plus button. Do the same thing for Address 1, First Name, Last Name, and Email.
- Click OK
After you click OK, the Remove Blank Data action is added to the Flow.
Rerun the actions by clicking the Run button in the top toolbar. When the Flow runs, it will remove any records meeting the null or blank criteria then it creates a new data set containing those records.
4. Profile Blank Data Action
When performing a data quality assessment, it is often useful to know what percentage of data points contain null or empty values. Once again, Flow provides several ways to do this, but the quickest and easiest is to use the Profile Blank Data action. This action counts the total number of null or blank values for each data point then computes a percentage for each using the count divided by the total number of data points.
Add Profile Blank Data Action
Click on the Actions menu then select the Summary Functions => Profile Blank Data menu item. The Describe Null and Blank Values dialog will appear. Follow the steps detailed below to add the Profile Blank Data action.
- Select "Addresses" from the Collection Name drop down list.
- Enter a result name in the Result Name text box, in this case, "Addresses Blank Data Profile."
- Click OK
After you click OK, the Profile Blank Data action is added to the Flow.
Rerun the actions by clicking the Run button in the top toolbar. The Profile Blank Data action will create a dataset called "Blank Data Profile" detailing the count of blank or null values and their percent for each data point.
5. Data Quality Evaluation Action
The "Remove Blank Data" and "Profile Blank Data" actions are an effective way to check your data for null or blank values. However, we need a fast and efficient way to apply the required combination of data quality dimensions to one or more data points.
Flow provides a Data Quality Check action which allows us to configure text, numeric, or date rules then apply them to individual data point values. For example, we can define a rule a rule to check each First Name value as follows:
- Require a value (completeness)
- Only allow alpha characters [a-zA-Z] (accuracy and validity)
- Require a minimum length of 2 and maximum length of 100 (validity)
- Check the value using a first name heuristic (validity, accuracy, and consistency)
Add a Data Quality Check Action
- Select Addresses from the Working Data drop-down
- Select First Name from Data Point drop-down
- Click the green plus button to open the Add Data Quality Rule dialog
- Follow the steps outlined below to add the data quality rule for First Name
Add a Data Quality Rule for "First Name"
- Select the Text / String Rules under Configure Rules
- Select Require Value in the Blank Values box
- Choose Alpa [a-zA-Z] in the Characters Allowed box
- Enter a minimum of 2 and maximum of 100 (this is probably too relaxed)
- Click OK to add the rule.
As shown above, there is a total of eight data quality rules that this action will evaluate when it runs. For the sake of brevity, I won't cover how to configure each these. If you are interested, this Flow is available in the Flow portal. See the Resources section below for more information.
Using only four actions, we've loaded our address data set, removed any empty records, created a data profile, and defined a set of data quality rules. Click Run to execute all the actions again and evaluate the data quality rules.
When we execute the Flow this time, the Data Quality Check will evaluate each data point then create a named data set containing the evaluation results. In this case, eighteen values failed one or more rules. The result data set includes a DQID (data quality id) which links each result to a record in the original data set. It also shows the rules applied to each data point along with which checks it failed.
Data Quality Rules Evaluation Result
Let's take a closer look at the "Date Entered" rule. The "Rules Applied" column for the "Date Entered" data point show application of "Data Type," "Minimum Date," and "Maximum Date" rules. The minimum date specified is 01/01/2017, and the maximum date is 06/30/2017. As you can see, three data point values failed the "Minimum Date" test. Note that although "Date Entered" has a string data type, Flow correctly uses the data type required by the action context.
6. Build a Set of Clean Addresses
To get a clean set of address records, we need to remove the all the Rejected Addresses from the Addresses data set. Flow provides a Denormalize action that joins two datasets using one or more common keys. The Denormalize action has three join types:
- Keep Matched and Unmatched
- Remove Matched
- Remove Unmatched
We will add a Denormalize action using DQID as the linking key and specify the join type as Remove Matched. This action will remove the rejected addresses from Addresses data set leaving the clean address records.
Add a Denormalize Action
From the Actions menu, select Working Data then Denormalize, the Denormalize dialog will display. Follow the steps shown below to add the Denormalize action.
- Select Addresses from the Parent Set drop-down
- Select Rejected Addresses from the Child Set drop-down
- Specify the Join Type as Remove Matched
- Select DQID as both the parent and child keys then click the green plus button
- Check the Save Result to New Data Set box then enter Accepted Addresses as the result set name. Click the OK button to add the action.
Click the Run button to execute the Flow.
When the Flow runs this time, it executes the Denormalize action which joins the Addresses and Rejected Addresses, removes the rejected addresses, and creates a new data set called "Accepted Addresses." The denormalize action also merges the data points contained in each data set so the Accepted Addresses data additional data points that we don't need. I will discuss how to remove these below.
7. Clean Up the Accepted Addresses Data
We want to remove the additional data points added to the Accepted Addresses data set by the denormalize action. The Remove Data Points action does just what we need. To add this action click Actions, select Data Points then Remove Data Points. The Remove Data Points dialog will load as shown to the right.
Finally, we perform some additional cleanup by applying proper case string expressions to the First Name, Last Name, City, and Address 1 data points.
Flow Actions and Final Results
In this article, I provide an introduction to measuring and evaluating data quality using Flow. I briefly discuss data quality dimensions and data quality assessment. Then I examine how a schema-on-write approach increases the time and cost required to assess data quality along with a brief discussion of schema-on-read technology. I then introduce Flow's "Generic Data" technology as a solution to the deficiencies of schema-on-write and schema-on-read for data quality. Finally, I provide a hands-on working example of doing data quality in Flow using some sample name and address data.
Get the Sample Data
To add the Addresses data to you account, log into Flow then click on the down arrow button in the top menu bar. From the drop-down menu, click on the icon to open the Add Sample Dataset dialog. Click the ADD link next to the "Addresses" entry. The dataset will be added to your account.