Technology

The Ticking Time Bomb You Should Know About Datasets

Dheeraj Jalali | August 1, 2024

A woman sitting in an orange chair holding her phones looks directly at the camera.

In the age of AI, data is king.

In this article

  1. The Problem: Data Without Consent
  2. The Fallout: Why Unethical Data Can Cripple Your Business
  3. The Culprits: How Data Providers Can Fumble
  4. Unintentional Mistakes:
  5. Intentional Misconduct:
  6. The Buyer’s Blind Spots: How You Can Be Exposed
  7. Unintentional Exposure:
  8. Intentional Blindness:
  9. The Solution: Prioritizing Ethical AI
  10. The Bottom Line: Don’t Gamble on Dataset Sourcing

Datasets are the critical data that tech companies need to train their AI models.

However, several major tech companies’ methods of obtaining essential datasets have landed them in hot water—and massive lawsuits. 

Lawsuits are mounting over suspect data sourcing, raising a critical question for any business looking to leverage AI: Is your AI training data built on a foundation of sand? 

This article explores the hidden risks of unethical data sources, highlighting how companies obtaining datasets can unwittingly (and in some cases knowingly) expose themselves to various legal, financial, and reputational issues.

But more importantly, we want to inform you about the short and long-term benefits of using ethically sourced datasets from the outset of your projects. 

Imagine this: You purchase a dataset and later discover it contains information scraped from the web without user consent. 

This scenario isn’t hypothetical. 

Many companies are facing massive lawsuits for training models on data with questionable origins. Some of these major companies have openly acknowledged that they obtained enormous amounts of AI training data by scraping the web without user consent, and taken the position that the principle of fair use doctrine permits them to access and use data on the internet for training models. 

Legal precedents in AI data sourcing are evolving rapidly.  

As lawsuits have mounted, most prominent tech companies are modifying their practices for obtaining AI training data and increasingly embracing ethical AI principles.

The Fallout: Why Unethical Data Can Cripple Your Business

Here’s a breakdown of the potential consequences of using unethically sourced data:

  • Legal Trouble: Like those significant tech companies, you and your company could be sued for violating privacy laws and regulations.
  • Reputational Damage: News of your involvement with unethical data can lead to bad press and consumer distrust.
  • Operational Delays: If your model is built on any data you may not have full legal rights, you may have to remove it after the fact and retrain your AI models, causing delays and wasted resources. When it comes to datasets, even a single unpermitted file can spoil the entire bunch.
  • Financial Strain: Lawsuits, retraining costs, and reputational damage can all take a significant financial toll.

The Culprits: How Data Providers Can Fumble

Unethical data can come from both intentional and unintentional mistakes made by providers. 

Here’s a closer look at both sides:

Unintentional Mistakes:

  • Lax Data Collection Practices: Data providers may need more procedures for verifying consent or rely on outdated methods that no longer meet current regulations.
  • Using ‘Open Source’ Training Datasets: Some major datasets are available at no cost and are described as ‘open source’. However, despite the open source label, these datasets may have the necessary legal consent from all the owners and sources of those data files. Don’t make the mistake of assuming ‘open source’ datasets give you the legal rights to use them however you want.

Intentional Misconduct:

  • Scraping Data Without Consent: Some providers knowingly scrape data from the Internet without user knowledge or permission.
  • Misrepresenting Data Origin: Providers may intentionally mislead buyers about the source and collection methods used for the data.

The Buyer’s Blind Spots: How You Can Be Exposed

While data providers shoulder most of the responsibility, buyers can also unknowingly contribute to the problem:

Unintentional Exposure:

  • Lack of Due Diligence: Failing to research a provider’s data-sourcing practices can leave you vulnerable to buying unethical datasets.
  • Focus on Price Over Ethics: Prioritizing low-cost datasets over ethical sourcing can lead to enormous hidden costs down the line.
  • Misunderstanding Data Provenance: Not fully understanding the data’s origin and collection methods can make it challenging to identify potential problems.

Intentional Blindness:

  • Turning a Blind Eye: Ignoring red flags or warning signs about a provider’s practices to secure a desired dataset.
  • Prioritizing Speed Over Scrutiny: Rushing the buying process can lead to overlooking crucial details about the data’s ethics.
  • Pressures to Meet Internal Goals: Prioritizing internal deadlines or cost-cutting measures over ethical considerations.

The Solution: Prioritizing Ethical AI

The good news is that you can avoid these pitfalls by prioritizing Ethical AI practices. 

Here’s what to look for when buying a dataset:

  • Transparency: Demand clear information about the data’s origin, collection methods, and consent procedures.
  • Reputation: Choose established data providers with a proven track record of ethical sourcing.
  • Guarantees: Look for providers who offer guarantees around data provenance and user consent.
  • Thorough Contracts: Look for contracts that almost exhaustingly spell out the terms, origins and rights of what you’re sourcing.

Voices takes Ethical AI practices very seriously, and we adhere to NAVA’s 3 Cs: Consent, Control and Compensation.

Consent: Actors should have the right to decide whether or not they want their voice used for any purpose, including AI. 

Control: Once a synthetic voice is made, the actor should be ensured that it stays where it is supposed to and isn’t used beyond the scope of any agreement made between the actor, AI company, and end client. Any usage beyond the original agreement is to be cleared with the voice actor whose voice is being used. 

Compensation: Actors should be paid fairly for the use of their voice print and the licensing of their voice and/or likeness.

The Bottom Line: Don’t Gamble on Dataset Sourcing

Unethically sourced data is a gamble you simply can’t afford to take. 

By understanding the risks from data providers and your blind spots, you can protect your business and build a foundation for success in the age of AI. 

Remember, doing it right the first time is not just good business; it’s the responsible choice.

In the fast-paced world of AI, cutting corners on data sourcing may be tempting. 

But the potential consequences – lawsuits, reputational damage, and wasted resources– far outweigh any short-term gains. 

By prioritizing Ethical AI, you’re not just protecting your business, you’re taking a stand for responsible innovation. Choosing ethically sourced data is an investment in your future, ensuring your AI models are built on trust and transparency.

Leave a Reply

Your email address will not be published. Required fields are marked *