Data quality, building data trust and identifying bias are critical for organizations to confidently make decisions based on the data they collect.
Organizations can harness great benefits from data, but understanding the importance of data quality, trust and avoiding bias allows them to make decisions and create profit.
At a fundamental level, data trust is when an enterprise has confidence that the data it is using is accurate, usable, comprehensive, and relevant to its intended purposes. On a bigger-picture level, data trust has to do with context, ethics, and biases.
The narrow definition is looking at how data is used in organizations to drive their mission, that narrow definition of data trust often gets support from tools that assess the quality of that data or automatically monitor the data across key metrics. Once you tick all the boxes, the organization can trust the data more. You must look not just at the specifics of data quality but who is the data for and who is involved in the process of designing the systems, assessing the systems, using the data.
But that definition of data trust is limited because data is part of a broader context. Companies should consider other factors when evaluating data trust beyond the basic operational ones.
The bigger picture is harder to quantify and operationalize but forgetting or ignoring it can lead to biases and failures, he added.
The cost of bad data
Organizations’ bottom line reflects the importance of data quality. Poor data quality costs organizations, on average, $13 million a year, according to a Gartner report in July 2021. It’s not just the immediate effect on revenue that’s at stake. Poor data quality increases the complexity of data ecosystems and leads to poor decision-making.
There’s a rule of thumb called the “1-10-100” rule of data that dates to 1992; it says a dollar spent on verifying data at the outset translates to a $10 cost for correcting bad data, and a $100 cost to the business if it is not fixed.
Eighty-two percent of senior data executives said data quality concerns represent a barrier to data integration projects, and 80% find it challenging to consistently enrich data with proper context at scale, according to a survey by Corinium Intelligence in June 2021.
Trust starts with the collection process. One mistake company make is assuming data is good and safe just because it matches what the company wants to track or measure. It’s all about understanding who provided the data, where it came from, why it was collected, how it was collected. Diversification also helps. A single source of truth is a single point of failure. That is a big task, but it is essential to prevent bias or adoption from being impacted by an individual’s preference versus the data bias required by the organization.
It’s also important to follow the chain of custody of the data after collecting it to ensure that it’s not tampered with later. In addition, data may change over time, so quality control processes must be ongoing.
For example, Abe Gong, CEO of data collaboration company Superconductive, once built an algorithm to predict health outcomes. One critical variable was gender, coded as 1 for male and 2 for female. The data came from a healthcare company. Then a new batch of data arrived using 1, 2, 4 and 9.
The reason? People were now able to select “nonbinary” or “prefer not to say.” The schema was coded for ones and twos, meaning the algorithm’s predictions would have yielded erroneous results indicating that a person with code 9 was nine times more female — with their associated health risks multiplied as well.
The model would have made predictions about disease and hospitalization risk that made absolutely no sense. Fortunately, the company had tests in place to catch the problem and update the algorithms for the new data.
In our open-source library, they’re called data contracts or checkpoints. As the new data comes in, it raises an alert that says the system was expecting only ones and twos, which gives us a heads up that something has fundamentally changed in the data.
Identifying biased data
It’s too simplistic to say that some data contains bias and some don’t.
There are no unbiased data stores. In truth, it’s a spectrum.
The best approach is to identify bias and then work to correct it.
There are many techniques that can be used to mitigate that bias. Many of these techniques are simple tweaks to sampling and representation, but in practice it’s important to remember that data can’t become unbiased in a vacuum.
Companies may need to look for new data sources outside the traditional ones or set up differential outcomes for protected classes.
It’s not enough to simply say: ”remove bias from the data”, we must explicitly look at differential outcomes for protected classes, and maybe even look for new sources of data outside of the ones that have traditionally been considered.
Other techniques companies can use to reduce bias include separating the people building the models from the fairness committee. Companies can also make sure that developers can’t see sensitive attributes so that they don’t accidentally use that data in their models. As with data quality checks, bias checks must also be continual.
5 key steps to optimize data operations
According to an Informatica sponsored IDC study in December 2021, organizations optimizing their data operations have taken the following steps:
1. Acknowledge the problem, understand what improvements are required and commit to continuous improvement
2. Reduce technical data by standardizing data management functions and adopting a complete enterprise data architecture
3. Allow self-serve access to data
4. Operationalize AI to automate functions, increase innovation and generate business value
5. Migrate data to the cloud
How to build data trust
One of the biggest trends this year when it comes to data is the move to data fabrics. This approach helps break down data silos and uses advanced analytics to optimize the data integration process and create a single, compliant view of data.
Data fabrics can reduce data management efforts by up to 70%. Gartner recommends using technology such as artificial intelligence to reduce human errors and decrease costs.
Seventy-nine percent of organizations have more than 100 data sources — and 30% have more than 1,000, according to a December 2021 IDC survey of global chief data officers. Meanwhile, most organizations haven’t standardized their data quality function and nearly two-thirds haven’t standardized data governance and privacy.
Organizations that optimize their data see numerous benefits. Operational efficiency was 117% higher, customer retention was 44% higher, profits were 36% higher and time to market was 33% faster, according to the IDC survey.
Article retrieval source: www.techtarget.com/searchdatamanagement.
For information about Qlik™, please visit this site: qlik.com.
For specific and specialized solutions from QQinfo, please visit this page: QQsolutions.
In order to be in touch with the latest news in the field, unique solutions explained, but also with our personal perspectives regarding the world of management, data and analytics, we recommend the QQblog !