The rise of the robot gangsters

Content harvesting robo-bandits are demolishing long held standards on the open internet.

Jun 27, 2024

During the westward expansion of the United States, the various governmental authorities in charge were at times simply reactive in their decision making. Settlers often made the facts on the ground before anyone in authority even knew of it, and land ended up being recategorised retrospectively. If there was a gold rush, then all bets were off.

Something similar seems to be occurring with greater frequency as data hungry Gen-AI businesses are now violating a basic protocol, the Robots Exclusion Protocol, which has helped keep the internet reasonably honest for years.

After writing previously about the suspect crawling activities of "AI answers" company Perplexity as revealed by Wired, it was soon reported by Reuters that according to research from a business aiming to insert itself between publishers and AI companies, plenty of others are ignoring specific crawling exclusions in the robots.txt file in order to harvest data for LLM training purposes.

A caveat here, as the company who say they have proof of this, TollBit, is "positioning itself as a matchmaker between content-hungry AI companies and publishers open to striking licensing deals with them". So it has a dog in the fight, as it were, but I'm willing to entertain that it might be a dog publishers can harness.

So, like the Wild West, we're seeing a land grab. Except the land is occupied by original content producers such as publishers. Are we to expect some retrospective legal rulings about content theft that change nothing after the fact?

“It’s complicated…”

Perplexity's CEO Aravind Srinivas has given an interview with Fast Company in which he attempted to explain away the concerns about content harvesting, telling the publication that the "mysterious web crawler that Wired identified was not owned by Perplexity, but by a third-party provider of web crawling and indexing services".

So that's ok then? It's not our burglar giving us the stuff. "It's complicated," said Srinivas. You bet it is. Perplexity have this week announced investment from SoftBank that values them at $3 billion.

Srinivas then took the position of a Formula One race team boss who has been discovered using a triple-baffle unobtainium centrifugal redistributor not specifically banned by the rulebook, telling Fast Company that that the Robot Exclusion Protocol is "not a legal framework" and suggesting that the emergence of AI requires a new kind of working relationship between content creators, or publishers, and sites like his.

Reddit have moved in the past few days to update their own robots.txt file. A spokesman told TechCrunch that "bots and crawlers will be rate-limited or blocked if they don’t abide by Reddit’s Public Content Policy and don’t have an agreement with the platform".

As a reminder, Reddit has cut a $60 million deal with Google to allow it to train its AI models on Reddit's user generated content. So it's extra motivated to turn things legal should anyone violate the terms of its Public Content Policy. (How embarrassing it would be if Google is found to have done the same thing itself prior to the agreement, or elsewhere!)

Again, most publishers don't have such a legal resort so maybe there is a slot for specialist outfits like TollBit to cut through the crap and get us all a better deal?

Scraping the bottom line
Revelations that Perplexity.ai ignored robots.txt file instructions to scrape content have flushed out allegations that plenty of other AI firms have been up to it too when pinching your stories and articles and insight. Reputationally bad, and perhaps catastrophic for future court judgements. And, likely to embolden discussion over assigning legal validity to robots.txt files.
Read

Reddit's edit
With perfect timing Reddit amends its robots.txt to kindly ask those sneaky AI bots to shut their eyes. Given its recent exclusive $60m+ deal with Google to allow Reddit content training of AI, and nobody being better placed than Google to spot breaches, we're keen to see what actions are taken against miscreants and how quickly formal lobbying to legislate against robbing robots occurs - if it isn't already.
Read

Corbidge comments... on bad robots and wild west tactics
The ignorance of long-held open web protocols looks like another case of move fast and steal things, says our resident content computer Rob. Will the law ever catch up?
Read

Suited and looted
Press Gazette's latest report into who is suing AI companies, and who is taking the cash in deals.
Read

AP's $100m local news push
AP expands its valiant efforts to bolster local media and news initiatives, establishing a non-profit sister organisation tasked with raising $100m for local news ventures.
Read

Is creativity a problem to be solved?
OpenAI's CTO Mira Murati likely has to watch her every word, given the company's beacon role in any discussion around the wider impact of AI. However, comments about AI replacing creative roles did little to make friends in the sector. Rather than join in a predictable dogpile over a slipped comment, it's worth watching the whole interview before making up your mind that OpenAI don't really know, or seem overly concerned about, what impact its tech will have.
Read

State of panic
In the related news box: AI start-ups fear a California Bill tabled for an August vote could simultaneously throttle sector development in the state - arguably Ground Zero for global AI dev right now - as well as reinforce the lead of well-heeled incumbents such as OpenAI and Google.
Read

Look the other way
Facebook feeds for many now resemble unmoderated wastelands of AI slop, dodgy pics, and scams, prompting the inevitable question: has the platform given up on moderation? This deep dive suggests it has, as well as other major platforms too.
Read

Game of clicks
A good read into what can happen at sites who all simultaneously realise a phenomenon is happening in front of their eyes. Both a cautionary tale of algorithm-inspired content uniformity, and the power of spotting a trend and keeping ahead of it.
Read

Clone rangers
In a similar theme to robots.txt being ignored, here is the danger of assuming the text in a T&C checkbox is anything more binding than a child's note explaining they are excused from homework for the rest of the year. Behold the startlingly low bar to having your voice cloned.
Read

Sounds familiar
Musicians and music companies take a lead from publishers in doing deals with AI heavyweights to stem the tide of theft and/or make quick buck. Similarly, only time will tell if it is a deal with the devil, or a clever hedge against the competition in their own sector.
Read

No Apple for EU
Apple Intelligence features might be caught at customs for EU users, with a number of new iPhone gimmicks unlikely to make it to users in the bloc due to Digital Markets Act smallprint.
Read

PS… come check out our new site… you know you want to. How else will you discover the new hue “Pigeon Blue”?