2018-03-15

How to Import and Analyze Common Types of File Data Sources

How to Import and Analyze Common Types of File Data Sources


In this blog post, I provide a worked example demonstrating how to import and analyze different types of file-based data sources.

File-based data sources are one of the most common means of storing data. File-based sources have various advantages compared to other methods of storing data such as their tendency to be lightweight, easily modifiable, and highly portable.

This blog post will take a look at how to work with and analyze data from three common types of file data sources.

The file-based data source types this post will focus on are:

  1. Delimited Files
  2. Excel Workbooks
  3. XML Documents

In the field of data analytics, it is essential to have a fundamental grasp of these three data sources. They appear almost every day in practice and have application in just about every functioning business.

The first two data sources, Delimited Files and Excel Workbooks, are typically used to store highly structured tabular data. That is - data organized in rows and columns.

XML Documents, on the other hand, are often used to store more complex data structures. XML takes a tree-based approach to the representation of data and allows for the persistence of nested hierarchical information.

XML often is used as a data-exchange and object serialization structure for software applications. Typically, it presents many challenges for data analysts when trying to convert from the tree-based structure to tabular datasets for analysis.

This is because a single XML document can be very complicated and has no real constraints on the number of different layers or branches that can nest inside it.

Despite the fact that XML is traditionally challenging to work with, it is very pervasive in business and the field of data analytics.

The reason for the ubiquity of XML is that businesses tend to rely on software - and software tends to rely on XML and thus as a result of this transitive relationship all types of business-critical data gets stuck in XML.

Flow eliminates the complexity of analyzing this type of data by providing an integration interface which can normalize any XML - no matter how complicated - into a set of correctly structured tables for analysis.

In the video below I provide multiple examples of how to consume these three different types of data sources as well as how to effortlessly overcome the challenges of loading and analyzing XML.

After I show how to import these three disparate data source types into generic data collections, I demonstrate how to build a hypercube and perform some basic analytics on the data.

Check out the worked example here: