Cloudflare offers one-click solution to block AI bots

Alfonso Maruccia

Posts: 1,185   +343
Staff
Why it matters: There is a growing consensus that generative AI has the potential to make the open web much worse than it was before. Currently all big tech corporations and AI startups rely on scraping all the original content they can off the web to train their AI models. The problem is that an overwhelming majority of websites isn't cool with that, nor have they given permission for such. But hey, just ask Microsoft AI CEO, who believes content on the open web is "freeware."

Just this past week, a report from Akamai was reconfirming that bots make up an enormous amount of overall web traffic, and that AI is making things much easier for cybercriminals and dishonest ventures.

Websites and content creators using content delivery and firewall services provided by Cloudflare now have an additional, easy-to-use solution to curb Big Tech's ability to unleash their bots and scrape web content without explicit authorization.

Most popular AI companies, like OpenAI, have started to provide a way to block their crawling bots through custom rules that can be added to a robots.txt file on the server. However, these solutions only work when the bot has been designed to actually follow these rules – the problem is that 1) not all companies are willing to honor robots.txt directives, and 2) many AI companies have already scrapped everything they could before offering this "opt out" – Cloudflare says that an overwhelming majority of its customers, as much as 85 percent, have already opted to block AI bots this way.

The new one-click solution offered by Cloudflare is available to both free and paying customers, and it can seemingly put an effective fight against AI bots that don't follow robots.txt rules. Cloudflare can identify bots and create individual fingerprints for each one, and it vows to automatically update its fingerprint database over time.

As one of the largest CDN networks on the internet, Cloudflare can extrapolate data from over 57 million network requests per second on average.

The company put together a list of the most active AI bots pillaging today's web, with Bytespider, GPTBot, and ClaudeBot being the three largest ones by share of websites accessed. Bytespider is operated by Chinese company and TikTok owner ByteDance, and is likely using content scraped from 40% of Cloudflare-protected websites to train its large language models.

GPTBot is accessing 35% percent of websites and is collecting data to train ChatGPT and other generative AI services offered by OpenAI. ClaudeBot has recently increased its request volume up to 11%, Cloudflare says, and is used to train the namesake family of LLM algorithms developed by Anthropic.

While these well-known bots should be easier to identify through a static analysis effort, Cloudflare can also detect bots pretending to be real people browsing the web.

The company developed its own global machine learning model and is essentially using AI technology to recognize AI bots pretending to be something else. Cloudflare said its model was able to "appropriately flag traffic" coming from evasive AI bots, and it will be used to detect new scraping tools and fake bots in the future without needing to generate a new bot fingerprint first.

Permalink to story:

 
Meanwhile, the silence from our government authorities is deafening. POS companies like MS, Google, Meta, OpenAI, Musk (I don't know what his AI company is called), Adobe to name a few operate with impunity it seems. AI is running amok and our clown governments that dropped the ball on allowing tech companies to dominate the world, are allowing those same scumbags to dominate AI and write all the rules.
 
Meanwhile, the silence from our government authorities is deafening. POS companies like MS, Google, Meta, OpenAI, Musk (I don't know what his AI company is called), Adobe to name a few operate with impunity it seems. AI is running amok and our clown governments that dropped the ball on allowing tech companies to dominate the world, are allowing those same scumbags to dominate AI and write all the rules.
I personally think this just makes any kind of piracy against these companies fair game. An eye for an eye and all that.
 
Cloudflare is a garbage service that often blocks legit users from access sites "protected" with it. Should be avoided at all costs.
 
Back