This Week in AI: Fighting AI Scrapers

Cutting through the noise in AI

Welcome to the Women Who AI Newsletter, your weekly update on what actually matters in AI when you’re focused on building and scaling startups.

Was this forwarded to you? We send out newsletters every Monday morning. Click here to subscribe and join our community of founders shaping the future of AI.

AI Scraper Traps: How Developers Are Fighting Back Against Unauthorized Data Mining

As AI companies race to train larger models on more data, a new battlefront has emerged: Website owners and developers increasingly deploy clever countermeasures against aggressive AI web crawlers that ignore standard "do not scrape" protocols.

Below, we break down the clever tactics emerging and why it matters for startup founders.

The Problem: AI Crawlers Behaving Badly

TechCrunch recently called AI crawlers the "cockroaches of the internet." These bots scour the web to collect training data for large language models, but many don't play by established rules:

  • They ignore robots.txt files (the standard protocol telling crawlers which pages not to index)

  • They hammer servers with requests, sometimes causing denial-of-service conditions

The impact can be devastating for open-source projects and smaller websites, especially. Some developers report spending "20-100% of their time in any given week mitigating hyper-aggressive LLM crawlers," with sites experiencing dozens of brief outages weekly.

Tactics to Block AI Crawlers

Website owners are fighting back with increasingly sophisticated methods:

1. Proof-of-Work Challenges (Anubis)

Developer Xe Iaso created Anubis, a tool that presents a computational challenge that browsers operated by humans can solve but bots struggle with. Named after the Egyptian god who weighed souls, Anubis effectively judges whether a visitor is human before granting access.

2. Honeypot Mazes (AI Labyrinth, Nepenthes)

Rather than simply blocking bots, some developers are creating elaborate traps:

  • Cloudflare's AI Labyrinth feeds crawlers AI-generated content that appears legitimate but is irrelevant, wasting their computational resources and poisoning their datasets.

  • Nepenthes (named after a carnivorous plant) traps crawlers in an endless maze of fake content, deliberately attempting to poison AI training data.

As demonstrated in Alan Smith's prototype, developers can:

  • Serve pages with links pointing to "honeypot" content filled with Markov chain-generated text

  • Use JavaScript to detect human interactions (clicks, taps) and dynamically update links to their actual valid locations

This approach allows humans to navigate normally while bots get stuck in meaningless content. Smith explains the rationale: "I have no problem putting poison in prohibited places. If you run a system that ignores the prohibitions, then ingesting the poison is what you designed the system to do."

Implications for Startups

This emerging defensive landscape has several implications for companies:

For AI Service Providers

  1. Infrastructure Challenges: Companies deploying AI crawlers may face increasing resistance and technical barriers, potentially increasing operational costs.

  2. Reputation Management: As aggressive crawling becomes associated with negative impacts on the web ecosystem, companies may face backlash for their data collection practices.

  3. Quality Control Challenges: If poisoning tactics succeed, services that rely on web data may receive contaminated information, potentially leading to less reliable AI outputs or requiring more intensive data cleaning.

For Content Creators and Site Owners

  1. New Protection Options: These tools offer new ways to protect valuable content from unauthorized use, potentially preserving competitive advantages.

  2. Infrastructure Protection: Beyond content concerns, these tools help protect the stability and performance of your digital infrastructure from being overwhelmed by aggressive crawlers.

  3. Technical Considerations: Implementing these protections requires careful design to ensure they don't impact legitimate users, especially those with disabilities or using assistive technologies.

  4. License Considerations: Some developers are exploring new license variations that explicitly forbid using the code in AI training datasets. While these may not be fully legally tested, they signal an intent that could influence future legal interpretations.

Hackathons

Ready to build that product you've been dreaming about? Check out these upcoming hackathons!

If you'd like to find a Women Who AI team for any event, reply to this email, and we'll connect everyone interested.

AI Agents Hackathon by Microsoft | Virtual | April 8-30 | $50K in prizes | Event Link

Robot Arm Hackathon | NYC | Sat & Sun April 19-20 | Event Link

P.S. Yesterday's Women Who AI Build Jam in Montreal and SF was incredible! We're excited to host more of these community-building events in the coming months. Stay tuned for announcements about our next locations and dates.

Reply With Questions

We want this newsletter to address the real challenges you're facing. Is there a specific AI development you'd like explained? Jargon we included but didn't properly explain? A business problem you're trying to solve with AI? Reply directly to this email with your questions, and we'll tackle them in next week's edition.

Share the Knowledge

If you found value in today's newsletter, please consider forwarding it to other women in your network who are building, or thinking about building, in the AI space. The more we grow this community, the stronger our collective impact becomes.

Here's to building the future of AI, together.

Lea & Daniela