Getting Your Data Frequency On
Data profiling is usually one of the first tasks performed on a data quality project. This is especially true when data is made available prior to documenting business requirements, and subject matter experts are available to discuss usage, relevance, standards and metrics for measuring and improving data quality. All this is necessary to progress from profiling your data to conducting a full data quality assessment. However, these are not acceptable excuses for delaying data profiling.
First of all consistent representation of data absence is a common standard of data quality assessment.
When is data profiling important?
- in the process of initializing a BI project – identifying problems at the “bud” saves time in connecting the BI solution
- when auditing previous BI deployments
- to data quality audit
One of the fundamental features of a data profiling tool is the ability to generate statistical summaries and frequency distributions for the unique values and formats found in data source fields.
Data profiling is often performed during a data quality assessment and involves much more than reviewing the results generated by a data profiling tool, and a data quality assessment obviously involves much more than just a data profile.
Data profiling can help you perform essential analyses. Here are some of them:
- Data verification matches the metadata that describes it
- Identifying missing values
- Identifying potential default values
- Identifying potentially invalid values
- Checking data formats for inconsistencies
- Preparing meaningful questions to ask subject matter experts
Data profiling can also help you with many of the other aspects of domain integrity, structural and relational, as well as determining functional dependencies, identifying redundant storage and other important data architecture considerations.
Data profiling metrics
Data profiling tools provide several metrics in the evaluation of datasets, for each data column, to summarize the characteristics of the content of these columns.
The most important ones are presented below:
NULL – quantifies the number of records with a null value
Missing data can be represented in several different ways, with NULL being the most common for relational database columns. In some particular situations, values containing only spaces (for string fields) or all zeros, for numeric fields, may also enter the null count. Constant and consistent representation of nulls is a common standard for evaluating data quality.
Missing – quantifies the number of records with a missing value (i.e. absence of non-NULL data, e.g. character spaces)
Current (actual) – quantifies the actual number of records with values, identifying populated records, compared to NULL (missing) records.
Completeness – percentage calculated as Actual divided by the total number of records.
Cardinality – count of the number of distinct actual values
Cardinality can play a major role in deciding whether or not you want values or formats to be displayed, as it is much easier to review all values when there are not very many. Alternatively, reviewing fields with high cardinality can also be limited to the most frequently occurring values.
Uniqueness – percentage calculated as Cardinality divided by total number of records
Distinctiveness – percentage calculated on Cardinality divided by Actual
Distinctiveness can be useful in assessing the potential for duplicate records. For example, a tax identification field may be less than 100% complete (i.e. not every record has one) and therefore also less than 100% unique (i.e. it cannot be considered a potential unique primary key because it cannot be used uniquely to identify each record). If the tax identification field is also less than 100% (i.e. some distinct actual values appear on more than one record), then this could indicate the presence of potential duplicate records.
Completeness and uniqueness are particularly useful in evaluating potential key fields and especially a unique primary key, which should be both 100% complete and 100% unique. Mandatory non-key fields can often be 100% complete, but low cardinality could indicate the presence of potential default values.
Data profiling tools can generate, in addition to the aforementioned metrics, many other useful summary statistics for each field, including: minimum/maximum values, minimum/maximum field sizes, and number of data types (based on direct analysis of values, not metadata).
Using histograms to represent the frequency distribution of unique shapes found in a field is also extremely useful, (sometimes more useful than explicitly identifying unique values). Here are some examples of use:
- Fields with low cardinality (i.e. indicating potential default values)
- Fields with relatively low cardinality (i.e. gender code and source code)
- Fields with a realistically low number of valid values (e.g. state abbreviation and country code)
- Fields that are expected to contain only one data type and/or length (e.g., integral interrogative key or ZIP + 4 additional code)
- Fields with a limited number of valid formats (e.g. telephone number and date of birth)
- Fields with free-form values and high cardinality (e.g. customer name and postal address)
Some fields can also be alternatively parsed using partial values (e.g. year of birth extracted from date of birth) or a combination of values and formats (e.g. account numbers expected to have a valid alpha prefix followed by all numbers).
Free-form fields (e.g. personal name) are often easier to analyze as formats constructed by analyzing and classifying individual values in the field (e.g. address formula, first name, surname, title).
Understanding your data is essential for effective use and improving its quality. To achieve these goals, there is simply no substitute for data analysis.
A data profiling tool can help you automate some of the difficult work required to begin this analysis. However, it is important to note that the analysis itself cannot be automated all the way through – you will need to review the statistical summary and frequency distributions generated by the data profiling tool and, more importantly, translate the analysis into meaningful reports and questions to share with the rest of the project team. Data profiling done well is a highly interactive and iterative process.
Data profiling is a critical success factor in BI solution implementation processes and in the absence of effective data profiling tools, it can be time consuming.
If the data profiling process is omitted, it is very likely that the information in reports built on this data may have significant holes, which may be the basis for future wrong decisions, a scenario that should certainly be avoided.
Based on these needs and requirements, QQinfo has built a tool-set to take on the bulk of the data profiling effort – QQdata.profiler™. You will discover in QQdata.profiler™ a tool ready to scan data and extract comprehensive, complex and relevant profiling on huge datasets, leaving only the last step of profiling analysis to the human factor.