2017-10-05

How to Deduplicate Data Using Flow Analytics

How to Deduplicate a Dataset in Flow


Often when developing a data processing workflow, it is crucial to identify and remove duplicate values.

Flow provides a deduplicate function which allows the workflow developer to identify and extract duplicates from a dataset using a single or compound key. By implementing the deduplicate function in a workflow the user can isolate unique values from a data set and develop powerful data transformation rules.

In this blog post, I will demonstrate how to configure and implement the deduplicate function.

The example we will explore will focus on a list of customer records stored as a delimited file. The file data will be loaded into Flows working memory as generic data. We will apply the deduplicate function to the customer records to eliminate repeat customer values.

The extracted duplicate values will be removed from the primary target dataset and stored off in a separate dataset for further examination.

This technique can be applied to any data set from any source. Once the data is deduplicated Flow can be used to output the dataset in any format required.

If you do not have a Flow account - register here. Flow is free for personal use.

The video below shows a worked example of how to perform the deduplicate operation. Check out the video here:

#if !DEBUG #endif