Yuri Gubin on Why Data Quality Became an Afterthought
From Gradual Growth to Instant Access
Traditionally, organizations built AI capabilities in stages: building strong data foundations, moving into advanced analytics, and graduating to machine learning. This allowed data quality practices to mature alongside technological growth.
The generative AI revolution disrupted this flow. Suddenly, AI became highly accessible. Everyone jumped on the wagon without the traditional preparation work that came before advanced analytics projects.
This led to a false sense of simplicity. While AI models can handle natural language and unstructured data more easily than earlier tools, they still depend on solid data to deliver reliable results.
The Garbage In, Garbage Out Reality
The principle "garbage in, garbage out" is more important than ever in AI. Poor data can cause systems to reinforce bias, spread misinformation, and trigger regulatory scrutiny.
For example, medical research once linked ulcers to stress, simply because all patients in the dataset experienced it. AI models, relying on the data, would have reinforced this error, missing the true cause: bacterial infection. Without accurate, contextual data, AI draws confident but wrong conclusions.
The Human Element in Data Understanding
Addressing AI data quality requires more human involvement, not less. Organizations need data stewardship frameworks that include experts who understand technical data structures, business context, and implications. These data stewards can catch subtle but important differences that pure technical analysis can't. In educational technology, for example, combining parents, teachers, and students into a single "users" category for analysis would produce meaningless insights. Domain experts know these roles must be treated separately.
Model and dataset analysis experts might not properly understand what the data means for the business. That's why data stewardship requires both technical and domain expertise.
This human oversight becomes especially critical as AI systems make decisions that affect real peopleꟷfrom hiring and lending to healthcare and criminal justice applications.
Regulatory Pressure Drives Change
Regulations, not internal initiatives, are now driving demand for data quality. States in the U.S. are passing laws on AI in hiring, licensing, and benefits, requiring organizations to explain their data use.
Clear data records, consent, and auditable processes are mandatory. Some data points simply can't be used if they present discrimination risks. The focus on explainable AI sets higher data quality standards.
Organizations must ensure data is not only correct and complete, but also organized for transparency in AI decisions.
Subtle Biases in Training Data
Data bias extends beyond obvious demographic characteristics to subtle linguistic and cultural patterns that can reveal an AI system's training origins. The word "delve," for example, appears more in AI-generated text because it's more common in training data than in typical American or British business writing.
Because of reinforced learning, certain words were introduced and statistically appear much higher in text produced with specific models. Users can see that bias reflected in outputs.
These linguistic fingerprints demonstrate how training data characteristics inevitably appear in AI outputs. Even neutral technical choices can introduce systematic bias affecting user trust and performance.
Quality Over Quantity Strategy
Despite the industry's excitement about new AI model releases, a more disciplined approach focused on clearly defined use cases rather than maximum data exposure proves more effective.
Instead of opting for more data to be shared with AI, sticking to the basics and thinking about product concepts produces better results. You don't want just to throw a lot of good stuff in a can and assume that something good will happen.
This philosophy contradicts the common assumption that more data automatically improves AI performance. In practice, carefully curated, high-quality datasets often produce better results than massive, unfiltered collections.
The Actionable AI Future
Looking ahead, "actionable AI" systems will reliably perform complex tasks without hallucination or errors. These systems would handle multi-step processes like booking movie tickets at unfamiliar theatres, figuring out interfaces, and completing transactions autonomously.
Imagine asking your AI assistant to book a ticket for you. Although that AI engine has never worked with that provider, it will figure out how to do it. You will receive a confirmation email in your inbox without any manual intervention.
Achieving this demands robust data quality, automated annotation, and security built directly into data processes.
Built-in Data Security
Future AI systems will need "data entitlement" capabilities that automatically understand and respect access controls and privacy requirements. This goes beyond current approaches that require manual configuration of data permissions for each AI application.
Breaking down data silos should not create new, more complex problems by accidentally leaking data. This represents a fundamental shift from treating data security as an external constraint to making it a natural characteristic of AI systems.
Strategic Implications
The data quality challenge mirrors the broader gap between technology's potential and organizational readiness. Companies investing now in stewardship, bias checks, and robust controls will have a clear edge.
Success depends on laying the groundwork: sound architecture, governance, expert oversight, and a culture that values data quality over rushing to market.
As rules tighten and AI takes on greater responsibility, neglecting data quality will carry bigger risks. Those with strong foundations will not only comply, but also build the trust and growth needed for long-term success.
To realize AI's promise, organizations must put data quality first. Organizations must treat data quality as a strategic imperative, not a technical afterthought. Those who do will lead; those who don't will continue struggling to make AI work as intended.







