Overview
From Wikipedia:
Benford's law, also called the first-digit law, is an observation about the frequency distribution of leading digits in many real-life sets of numerical data. The law states that in many naturally occurring collections of numbers, the leading significant digit is likely to be small. For example, in sets which obey the law, the number 1 appears as the most significant digit about 30% of the time, while 9 appears as the most significant digit less than 5% of the time. By contrast, if the digits were distributed uniformly, they would each occur about 11.1% of the time. Benford's law also makes predictions about the distribution of second digits, third digits, digit combinations, and so on.
Intuitively, many people assume the distribution of first digits follows a random distribution with each first digit occurring approximately 11% of the time. The left-hand chart below shows the distribution of first digits for a randomly generated series of numbers. The chart on the right represents the expected distribution according to Benford's law.
Distribution of Random Numbers
Distribution According to Benford's Law
The equation used to derive the distribution of first digits under Benford's law is
P(digit) = log10(1+1/digit)
This equation was used to generate the table shown to the right.
Digit | P(Digit) |
---|---|
1 | 30.1% |
2 | 17.6% |
3 | 12.5% |
4 | 9.7% |
5 | 7.9% |
6 | 6.7% |
7 | 5.8% |
8 | 5.1% |
9 | 4.6% |
Fraudsters who attempt to invent numbers for various illicit purposes, such as faking invoices or hiding disbursements, may assume the best way to conceal their activities is to enter randomly distributed numbers not having the correct distribution. This often mistaken assumption has made Benford's law a primary tool of forensic accountants. Benford's law has broad application beyond fraud detection. I've included links to other resources at the end of this post for those who are interested.
Despite the popularity of Benford's law, there are relatively few tools available to non-technical analysts who wish to apply it. I was able to find numerous examples written in either Python or R, but these cannot be used without necessary software, i.e. compilers and editors, and required coding skills. I also found a few spreadsheet templates which were too complex for easy reuse and limited to relatively small datasets, in addition to the usual spreadsheet issues.
Given the dearth of practical tools available to analysts, I thought it might be useful to develop a basic, reusable Flow to do the following:
- Import a dataset
- Extract the first digits and count their frequency
- Calculate the percentage distribution for each digit
- Finally, compare the sample data's distribution to the expected distribution according to Benford's law
To make this example realistic, I use a sample dataset named, not surprisingly, "Benford Set." It consists of 425 observations drawn from a random sample of transactions. I will walk through the steps required to build this Flow so that you can try it yourself. I've also made it available in the public Flow library.
Build the Benford Analysis Flow
We'll start by adding a new Flow and loading the sample dataset. Then, we'll add an expression to extract the first digit from each observation. Next, I'll discuss how to add a hypercube dataset and use it to group, count, and compute the percentage frequency of digit. Finally, we'll use the equation shown above to generate the expected frequencies according to Benford's law.
If you wish to try this Flow yourself, I have made the sample data and the Flow application available. Instructions on how to access them are here.
Here are the basic steps I'll cover below.
- 1 Add a new Flow
- 2 Load the Sample Dataset
- 3 Extract the first digit
- 4 Build Hypercube
- 5 Count Digits
- 6 Compute the Percentages for Each Digit
- 7 Slice the Hypercube to Get the Distribution Dataset
- 8 Expected Value and Difference
1. Add a New Workflow
To get started, we'll need to add a new Flow, so click the Add Flow button in the upper left toolbar. The Add New Workflow dialog will appear as shown on the right.
Enter a name, in this case, "Benford Analysis," then click OK. The new Flow will appear in the list on the left-side of your screen.
To open the Flow for editing, simply double click its name. Alternatively, you can right-click the name and then select Open from the context menu; its name will appear as shown to the right.
2. Load the Sample Dataset
We start by loading our sample dataset, so add a Load Dataset action by clicking on the Actions menu then selecting Load Dataset from the drop-down menu. TheLoad Dataset Dialog will appear as shown to the right.
Select the "Benford Set" from the drop-down list then click the OK button to add the action.
After you click OK, the Load Data action will appear in the Action Editor.
To run this action, click the Run Button in the Action Editor toolbar, then click Yes to the Run All prompt. The Flow run-time engine will execute the action and load the sample data which appears in the Working Data tab as shown.
3. Extract the First Digit
We need to get the first digit of each observation, to do this, we'll add an expression to extract it from each data point.
Add Expression Steps
Click on the Actions menu then select the Expression Builder menu item. The Expression Builder dialog will appear. Follow the steps detailed below to add the FirstNCharacters expression.
- Select Benford Set from the Collection drop-down list
- Choose "String" as the expression type.
- Select FirstNCharacters as the operation.
- Select "Value" as the Input 1 datapoint
- Enter 1 under "Literal Value"
- Enter "Digit" in the Result text box
- Finally, click the Add Expression then click OK.
After you click OK, the expression will appear as the second step in the actions list
Rerun the actions by clicking the Run Button in the Action Editor toolbar. When the Flow executes, it will extract the first digit from each Value data point, save it as Digit, then display the updated dataset in the Working Data tab.
4. Build Hypercube
Our sample data contains 425 observations. We need to group these by Digit then count the members of each group. We'll accomplish this by building a single dimension hypercube using Digit. Then we'll add a hypercube expression to count each group's members.
Build Hypercube Steps
Open the Build Hypercube dialog by clicking on the Actions menu then selecting Hypercube => Build Hypercube. After the Build Hypercube dialog will appears, follow the steps outlined below.
- Select "Benford Set" from the Working Data drop down list.
- Check the Digit box to select it.
- Enter "Benford Distribution Cube" in the Hypercube Name text box, then click OK to add the action
The Actions Editor will now display the Load Data, First Digit, and Hypercube actions.
Rerun the actions by clicking the Run Button. When the Flow completes, a Hypercube dataset appears in the Working Data tab.
Flow hypercubes are specialized data structures that handle the underlying complexity of organizing and managing multi-dimensional data. They also facilitate and optimize any computational operations performed on multi-dimensional datasets.
5. Count Digits
Our Benford Distribution Cube contains nine groups, one for each digit. We need to count the total members in each group. To do this, we'll add a hypercube expression Count expression using the Expression Builder.
Add Count Expression Steps
Open the Expression Builder dialog again by clicking on the Actions menu then selecting the Expression Builder menu item. Follow the steps detailed below to add the Count expression.
- Select Benfords Cube from the Collection drop-down.
- Check the Hypercube box.
- Select the "Stat" expression type then the "Count" operation
- Select Digit from the Input 1 Datapoint drop-down list.
- Enter "Frequency" in the Resulttext box
- Click the Add Expression button to add the expression then click OK.
Rerun the actions by clicking the Run Button.
After the Flow runs, our cube will contain the Frequency data point with a count of observations for each group.
6. Compute the Percentages for Each Digit
To compute a percentage for each group of digits, we'll use a Hypercube Groups Function.
Hypercube Groups Steps
From the Actions menu select Hypercube then the Hypercube Groups menu item. Follow the steps outlined below.
- Choose Member Percentage of Group from the Function drop-down list.
- Select "Benfords Cube" from the Hypercube drop-down list.
- Choose the Frequency data point from the Group Datapoint drop-down list.
- Enter "Percent" in the Result Datapoint text box, then click OK to add the action.
Rerun the actions by clicking the Run Button.
After the Flow executes, the Benford Distribution Cube contains a Percent value for each digit group.
7. Extract the "Benford Distribution" Dataset
To extract our distribution data from the hypercube, we'll use a filter action to select all the dimension "1" group values. This action will create a new two-dimensional dataset named Benford Distribution that contains each digit 1 through 9 along with its Frequency and Percent.
Filter Hypercube Steps
From the Actions menu, select Working Data, Filter, then Data Point Filter. The Filter dialog will display. Follow the steps outlined below to add the Filter Action.
- Select "Benford Distribution Cube" from the Filter Data drop-down
- Enter "Benford Distribution" into Match Data
- Select Dimensions from Input 1, Equal from Comparison, then enter "1" under Input 2
- Enter Percent in the Result Datapoint text box.
- Finally, click the green plus to add the filter operation then click OK
Rerun all six actions by clicking the Run Button.
When the Flow runs, it will execute the filter action which will extract the Benford Distribution dataset.
8. Expected Value and Difference
The final action will compute the expected frequency for each digit using the equation for Benford's law introduced above. It will then calculate the difference between the expected and observed frequency of first digits and, finally, round off each data point to four decimal places.
I won't cover entering each expression here. In summary:
- divides one by the value of Digit and stores the result as Expected, and
- adds one to the result of 1, and
- takes the base 10 log of 2
The fourth expression calculates the difference between Percent and Expected and stores it as Difference. The remaining expressions simply round the results.
The complete Benford Analysis Flow, steps 1 through 8
Here is the final result of our Flow along with a chart showing the observed vs. expected distribution of the first digits in our sample dataset. I'll leave a discussion of the various tests we could employ to determine if the difference is statistically significant for later. I will, however, note that for many data sets, a difference greater than .015 is considered significant. The differences shown may be due to the relatively small size of our sample data set.
Summary
In this post, we built a reusable Flow that calculates the distribution of first digits in a sample data set. The Flow loads the sample data set obtains the first digit from each observation, builds a hypercube and uses it to count the first digits, extracts a dataset containing the distribution and, finally, computes the expected distribution and compares it to the observed distribution by taking the difference.
To use this Flow with another data set, delete the initial Load Data action and insert a new action that imports data from the required data source. Rename that dataset "Benford Set," then rename the data point you wish to test to "Value" and run the Flow.
Resources
Add the Sample Data
To add the Benford Set data to you account, log into Flow then click on the down arrow button in the top menu bar. From the drop-down menu, click on the icon to open the Add Sample Dataset dialog. Click the ADD link next to the "Benford Set" entry. The dataset will be added to your account.