Poor data quality is enemy number one to the widespread, profitable use of machine learning. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for machine learning. The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.
To properly train a predictive model, historical data must meet exceptionally broad and high-quality standards. First, the data must be right: It must be correct, properly labeled, de-deduped, and so forth. But you must also have the right data — lots of unbiased data, over the entire range of inputs for which one aims to develop the predictive model. Most data quality work focuses on one criterion or the other, but for machine learning, you must work on both simultaneously.
Yet today, most data fails to meet basic “data are right” standards. Reasons range from data creators not understanding what is expected, to poorly calibrated measurement gear, to overly complex processes, to human error. To compensate, data scientists cleanse the data before training the predictive model. It is time-consuming, tedious work (taking up to 80% of data scientists’ time), and it’s the problem data scientists complain about most. Even with such efforts, cleaning neither detects nor corrects all the errors and as yet, there is no way to understand the impact on the predictive model. What’s more, data does not always meet “the right data” standards, as reports of bias in facial recognition and criminal justice attest.
Increasingly-complex problems demand not just more data, but more diverse, comprehensive data. And with this comes more quality problems. For example, handwritten notes and local acronyms have complicated IBM’s efforts to apply machine learning (e.g., Watson) to cancer treatment.