How to confound AI crawlers? Feed them on a diet of their own work
Cloudflare's CEO has unveiled an ingenious tool to help publishers keep their sites AI crawler-free.
An unlikely ally is coming to the aid of publishing. Unlikely, because it has felt for ages that much of the consumer-facing tech industry has succeeded at the cost of our own publishing industry, and has not much concern for us, even while enabling easily-reined smaller content creators to thrive under its dubious umbrella.
Enter Matthew Prince, CEO of Cloudflare. It was Mr Prince who recently supplied us with the interesting metric of pages scraped by a search engine versus the number of visitors that the site scraped would receive in return. Last week Prince updated us with the latest such data and take a deep breath, because it's rough.
Six months ago the ratio of pages crawled by OpenAI to visitors it sent a publisher was 250:1. We thought that was bad. A little more than 24 weeks later and the figure is 1,500:1.
Twenty-four weeks.
It's not going to get better until OpenAI et al hear the slurping sound at the bottom of publishing's cup is it? And it's not just them. Being so large makes them answerable, but once you get past the foolish and arrogant "we are the future" waffle there are countless smaller GenAI actors in operation too and in darker places.
Prince has now announced that Cloudflare have produced a tool that will confound the most determined of AI crawlers, and the way it operates is actually quite delicious. Rather than blocking a crawler - an action which alerts it to the block and allows it to recalibrate its attack - the Cloudflare tool instead leads the crawler into an endless maze of (and here's the delicious part) GenAI content. Meaningless GenAI content. A technological moron mirror, if you will. Hoist with their own petard.
It's also dual use, acting as a honeypot for malign actors, as Cloudflare themselves point out: "No real human would go four links deep into a maze of AI-generated nonsense. Any visitor that does is very likely to be a bot, so this gives us a brand-new tool to identify and fingerprint bad bots."
It's hard not to sound like an advert for Cloudflare writing this, yet Prince's intervention is not entirely altruistic of course. A successful business such as Cloudflare wants a healthy, thriving web in which to transact more business. It's likely that Prince sees a future in which the living, creative web has been shrivelled on the vine by the ravage of AIGen aphids.
He's also suitably pugnacious about the matter, in a way in which few in publishing are willing or able. He didn't quite say "I'm going to clock those AI cowboys so hard they'll be saying IA for a month!" (No offence to our Spanish readers! - Ed) but he kinda did in saying, as reported by Axios, that "I go to war every single day with the Chinese government, the Russian government, the Iranians, the North Koreans, probably Americans, the Israelis, all of them who are trying to hack into our customer sites. And you're telling me, I can't stop some nerd with a C-corporation in Palo Alto?"
It's only a single statement of defiance, but it's indubitably defiance, and from someone who can do something about it. More defiance please. Stop the malign nerds.
It's not the first good idea Cloudflare have presented in reaction to the AI-scraping age. The company also still aims to create a marketplace for site owners to sell AI model providers access to their site’s content and Prince may have been alluding to this when he mentioned another tool Cloudflare is working that will stop content scraping.
"That's the easy step, and that's coming very, very soon, and every publisher you have ever heard of is on board," he said at Cannes.
We're likely all reading some of the many SEO takes on getting your brand and content featured in AI search results at the moment. Every single one I read, no matter how skilled and insightful the SEO practitioner is, seems to me to be an account of how to thread a beautiful strand of content through the eye of a particularly jagged and uncooperative needle. Who would want to do that?
It's begging for crumbs, and begging for crumbs will make you a beggar.
A rebalancing is required if the web which has grown so wonderfully in our lifetimes is not to become creatively barren. Tools such as that built by Cloudflare supply some of the treatment required. The law, and specifically copyright law, must do much of the rest.
We'll likely see more of these kinds of "AI baffling" tools become available in the near future. Prince can't be alone in seeing the grim prospect ahead and wanting to stop it.
Nothing is inevitable.
Calling all media and publishing content leaders, product leads, data strategists and tech innovators - we want to hear from you!
We’ve teamed up with State of Digital Publishing to launch a benchmark study assessing first-party personalisation, CDP adoption, and audience engagement in digital media and publishing.
This report will:
Benchmark CDP adoption and performance
Reveal emerging personalisation trends
Identify challenges and ‘Aha’ moments from across the industry
Deliver actionable best practices
Give us 5 minutes to complete the survey and we’ll give you early access to the findings to see how you compare.
Harry Potter and the Den of Thieves
Researchers have managed to get Meta's Llama 3.1 70B LLM to reproduce 42% of Harry Potter and the Sorcerer's Stone as a verbatim block. The lime big AI is sticking to is that such occurrences are merely "fringe behavior". Well, in this test and on others it wasn't fringe at all. As we argue over copyright, what is "transformative", and what is theft, research like this is invaluable. However, observers point out it may lead to AI companies dropping open models in favour of closed alternatives: Cornell and Stanford researchers could only do their work because they had access to the underlying model.
Read
BBC vs Perplexity: the scrape debate
The BBC is turning up the heat on AI firm Perplexity, threatening legal action over copyright and claiming it scraped content without so much as a how-do-you-do. As expected, Perplexity denies any wrongdoing, but the BBC wants their articles deleted and a cheque in the mail. This is yet another notch in the belt of the growing struggle over AI's appetite for publisher content, and this one's going to get loud it seems. As it should.
Read
Midjourney "cracks down"
Midjourney's flashy new video tool has wowed users while also dropping red flags. Disney and Universal no doubt loved that it churns out characters such as Toothless and Shrek without much effort, which might be why Midjourney found enlightenment via guardrails which give you a firm "nope" when you ask for copyrighted characters for videos. Err, but you can still get stills of copyrighted characters. While the legal firestorm brews around cartoon dragons and trolls, it will doubtless be asked why it can block one bunch of copyrighted work but not others. Disney lawyers will chomp hard on that, and the rest of us will be waiting for the answer.
Read
Meta won the fight, not the war
Meta scored a legal win as a federal judge tossed out a lawsuit from 13 authors, including Sarah Silverman, who accused Meta of training their AI on books without permission. The reason: the authors couldn't prove real market harm. Although Meta won, the judge agreed that training on copyrighted materials is illegal, just not in this case. This decision is a boost for AI firms betting on fair use as their legal shield, and a wake up call for authors and publishers who'll need stronger arguments (and receipts) if they want to win in court.
Read
Future of Media Awards: the time to vote is now
Head's up! Tonight's the final curtain to get your entries for great work in digital media submitted in Press Gazette's Future of Media Awards. Don't miss you chance - get those submissions in before the clock runs out! The deadline is at 11:59 PM BST.
Read
Anthropic wins fair use, but faces piracy trial
Anthropic scored a legal win after a US judge ruled that using legally purchased books to train its AI falls under "fair use". It would appear that transforming text into machine smarts counts as creativity now, and that "transformative" question is likely one that will be returned to in law at a later date. It's certainly too early for Anthropic to celebrate still, since Judge Alsup also pointed out that stashing seven million pirated books in a "central library" isn't exactly fair play and damages could be awarded. That part's heading to trial in December. Is AI rewriting the rules of copyright, or just testing how far it can bend them? It's not far off to say "both".
Read
SEO is dead, long live SEO!
Some rush to claim SEO is dead, but it's far from it. While AI search is considered a shiny new toy, it's still playing on the same old instruments: structured content, authority, and good indexing. LLMs might answer differently, but they are still pulling from the SEO playbook. SEO detective Lily Ray shares her insight.
Read
Non-profit leaders spill the beans on crisis management
This month at The Institute for Nonprofit News Day (INN), a packed house tuned in for a panel on "Leading Organizations Through Change (And Sometimes Crisis)". Top non-profit news executives from The Texas Tribune, The Intercept, and Resolve Philly spilled hard-earned wisdom on handling leadership shake-ups and controversy. Boardrooms and newsrooms took notes, and Nieman Lab shares what they learned.
Read
UK targets Google with new search regulations
UK competition watchdog the CMA is gearing up to put Google on a tighter leash with "strategic market status." That means users might soon get a menu to pick their favourite search engine, publishers could get a say in how their content shows up in AI answers, and fair play rules might shake up the rankings. The CMA's just getting started - more rules on ads and competition could land by 2026. As expected, Google's already grumbling that all this fuss might slow down new UK product launches.
Read
AI Mode lands in Search Consoles
Google AI Mode data has finally shown up in Search Console, counting clicks and impressions a bit differently than the norm to keep things interesting. It seems to be a whole new game, and Brodie Clark has the scoop on the rules.
Read
Musk, FTC, and Media Matters: a retaliation showdown
Liberal watchdog Media Matters has filed a lawsuit against the Federal Trade Committee, claiming it's under investigation for reporting on extremist content on X. According to the suit, Elon Musk didn't take kindly to the bad press. Now the watchdog says it's facing a probe which it says is more about political payback than policy. Grab your popcorn.
Read
Meta under fire for AI labelling failures
Meta's Oversight Board has fired another barrage, calling out the social giant for its patchy approach to labelling AI-manipulated media, especially audio and video. With the company leaning on third parties and stumbling over consistency, questions over transparency are pilling up. With 60 days to respond, the response might give a hint at what we can expect its approach to be towards deepfakes and doctored content.
Read



