Clean Data Essential for Reliable Data Analysis

Clean Data Essential for Reliable Data Analysis

How much confidence can you place in the conclusions of your data analysis? For regulatory compliance actions, the cost of errors can involve fines, lawsuits, and damage to the brand. When being wrong involves significant costs, it is important to consider ways in which errors in the underlying data could be undermining your results.

Many data analyses are intended to predict changes in outcomes based on changes in inputs. A set of measurements on inputs are used to make predictions about one or more measured outcomes. Varying an input leads to a predicted response in the outcome. Errors in either set of measurements – inputs or outcomes – make it harder to make accurate predictions of the outcome response.

For example, in conducting a pay equity analysis, the predicted responses are the wages that employees would have been paid after removing apparent wage disparities. Here inputs include the employees’ genders as well as measures of the employees’ productivity. Outcomes are the wages received by the employees. Errors in measures of productivity or in the wages paid weaken the accuracy of predictions of the wage response.

This post focuses on some of the consequences of the simplest form of these measurement errors – random errors. Random errors can infect the measurements of outcomes, input factors, or both. Measurement errors are random if values are overstated about as often as they are understated. In the case of pay equity, random errors would imply that employee-level measures of wages, tenure, education, training, and/or hours worked, are as likely to be overstated as understated. For classification variables, such as benefits eligibility, part-time status, and the presence or absence of hazardous working conditions, random errors imply that misclassification in either direction is equally likely.

Such “noisy” data tend to obscure the relationship between the true inputs and the true outcomes. Fortunately, statistical methods regularly report the accuracy of responses as “confidence intervals” – high and low values for the response. For example, a pay equity analysis may estimate that females are paid 10% less than males. With this should come a confidence interval – for example, providing a range estimate that the pay disparity is between 8% and 12%.

Confidence intervals should always be consulted in evaluating the conclusions of data analysis. When noise infects the measurement of the outcome variable (e.g., wages), confidence intervals expand appropriately to reflect the loss of precision arising from this additional measurement noise. If confidence intervals are too large to be relied upon, it is worth assessing whether the outcomes may be affected by measurement error and if so, to evaluate the costs of cleaning those outcome measures.

However, depending on the type of measurement errors affecting your data, confidence intervals can be misleading as measures of accuracy. Specifically, when random errors affect the measurement of an input (e.g., years of education) confidence intervals also expand (i.e., are less precise), but the response of the outcome to the input is also biased downwards towards zero.

For example, when education measurements are affected by noise, the measured response of wages to education will likely be smaller than it really is. The measured effect might be that average wages rise by three percent per additional year of education, but the true effect could be five percent. The confidence interval for the measured, three percent education effect might provide a range of two percent to four percent, systematically lower than the true effect of five percent. If correctly measured, education would play a larger role in explaining apparent wage disparities. This bias is well-known, and is referred to variously as “attenuation” or “dilution” bias. Standard confidence intervals do not correct for this bias.

There is a second consequence when random measurement error affects an input. When the effects of multiple inputs on an outcome are measured, measurement error in one input will generally distort responses for all inputs, not just the one measured with error. In the pay equity example, measurement error in education can also distort measured wage responses to tenure, training, job type, and even the effect of gender itself on wages.

While the measurement error of the infected input makes its effect on the output seem smaller, in general, the direction of distortions affecting other inputs is unknown. For example, the measured average wage difference due to gender could be artificially magnified by measurement error in other inputs, such as education. As well, the measured effect of an accurately recorded factor such as job type on wages can be artificially large or small due to measurement error in another factor (e.g., education).

For these reasons, evaluating whether measurement error affects inputs in your analysis is an even higher priority than whether measurement error affects outputs. As the saying goes, “Garbage In, Garbage Out” or “GIGO”. Clean input data are foundational and essential to drawing reliable conclusions from your data analysis.

The short URL of the present article is: