Like me, I’m sure you’re keeping an open mind about how Generative AI (GenAI) is transforming companies. It’s not only revolutionizing the way industries operate, GenAI is also training on every byte and bit of information available to build itself into the critical components of business operations. However, this change comes with an often-overlooked risk: the quiet leak of organizational data into AI models.
What most people don’t know is the heart of this data leak comes from Internet crawlers which are similar to search engines that scour the Internet for content. Crawlers collect huge amounts of data from social media, proprietary leaks, and public repositories. The collected information feeds massive datasets used to train AI models. One dataset in particular, is the Common Crawl, an open-source repository that has been collecting data since 2008 but goes back even further, into the 1990s with The Internet Archive’s Wayback Machine.
Common Crawl has and continues to collect vast portions of the public Internet every month. It’s amassing petabytes of web content regularly, providing AI models with extensive training material. If that’s not enough to worry about, companies often fail to recognize that their data may be included in these datasets without their explicit consent. How would you also like to know that the Common Crawl can’t distinguish between what data should be public, and what should be private?
I’m guessing that you’re starting to feel concerned since Common Crawl’s dataset is publicly available and immutable, meaning once data is scraped, it remains accessible indefinitely. What does indefinitely look like? Here’s a great example! Do you remember the Netscape website where we had to actually buy and download the Netscape Navigator browser? The Wayback Machine does! Just another reminder that if an organization’s website has been made publicly available, its content has likely been captured forever.
All rights to the original content remain with respective copyright holders. See fair use disclaimer below.
If you’re concerned about what to do next, start by verifying if your company’s data has been collected.
- Utilize tools like the Wayback Machine at web.archive.org to review historical web snapshots.
- Perform advanced searches of the Common Crawl datasets directly at index.commoncrawl.org
- Employ custom scripts to scan datasets for proprietary content on your publicly facing Internet assets. You know, the stuff that should be behind an authentication wall.
Want some more fun facts? Once trained, AI models compress these gigantic amounts of data into significantly smaller instances. For example, two petabytes of training data can be distilled into as small as a five-terabyte AI model. That’s a 400:1 compression ratio! So protect these valuable critical assets like the crown jewels they are because data thieves scour through your company’s network looking for these treasured models.
Starting today, there are two types of data in this world, Stored and Trained. Stored data is unaltered retention of information like database, documents, and logs. Trained data is AI-generated knowledge inferred from patterns, relationships, and statistical modeling.
I bet you’re a bit like me and also wondering what the legal and ethical implications are for training GenAI on these massive data sets. A prime example of AI’s data exposure risk is the American Medical Association’s (AMA) Healthcare Common Procedure Coding System (HCPCS). These medical codes are copyrighted, yet AI models trained on public datasets can generate and infer them without a paid license. Some organizations like the New York Times and groups of authors already have their lawsuits filed around copyrighted content violation. So for now, we have to wait and see how these arguments get tested in the courts.
And this is why I say that GenAI is capable of quietly leaking your companies’ data. All you have to know is the right “prompt”, which is asking GenAI the right question, and like HCPCS codes, it provides the best response it can come up with based on generalization and inference of the patterns and relationships it learned during training. Now ask yourself, is that Trained GenAI as good as Stored data?
I will say though, there is some “good” news if you want to protect your organization from having its data collected in these large data sets and ultimately protecting yourself from quiet leaks through GenAI.
- Crawlers who are ethical and respect the rules can be regulated by implementing a robots.txt file which tells dataset scrapers not to index your content.
- Common Crawl will exclude your data when requested but past records remain untouched.
- Security audits can help identify what data is publicly accessible on the Internet and whether it should be moved behind authentication walls.
- Implement data classification policies and train employees on best-practices for managing data to prevent unauthorized content from becoming publicly available to these crawlers.
Is the quiet data leak going to stop GenAI adoption? No! Is it going to require more Risk Management? Yes!
AI is going to reshape industries in ways we can’t even predict. We are just beginning to see regulations like California’s SB 892 starting in 2027 and EU’s AI Act which is in already in effect. These regulations along with GenAI legal challenges make it even more important that organizations strike a balance between innovation and data security. Just imagine your organization failing to manage AI-related risks and ending up with legal liabilities from unauthorized use-cases, regulatory penalties for non-compliance, and reputational damage due to AI generated misinformation.
Want to stay far away from these problems? Here are some recommendations for what you can do.
- Clarity – Structured & Accountable AI Governance
Use AI specific risk and compliance frameworks for responsible usage
- Collaboration – Integrated Risk & Business Strategy
Embed AI governance within core processes for proactive risk management
- Controls – Scalable & Adaptable Security Framework
Align AI policies and security controls to meet business objects
- Continuity – Proactive, Continuous Risk & Compliance Monitoring
Adapt to the evolution of AI using ongoing compliance validation
- Culture – Cyber Risk Ownership & AI Ethics Mindset
Promote a security-first culture to embed AI ethics, security, and risk awareness
I’m not sure if you recognized, but each of these recommendations starts with the letter C, so from now on we can call them the “Five Cs of GenAI Risk Management”.
What happens next is that organizations need to take proactive steps to protect their intellectual property and sensitive information from unauthorized AI training datasets. This is because we all know that AI-powered innovations will continue to evolve, and data security cannot be an afterthought.
So if you haven’t gotten around to defining risk management policies for GenAI, validating alignment with regulatory and compliance standards, and managing the risks using the Five Cs, don’t worry, most people haven’t either. But it’s time for you to get serious about protecting your companies’ data from the quiet data leak by GenAI.