What just happened? The use of copyrighted material to train AI has become a hot-button issue, with experts divided on whether it constitutes theft or a legitimate form of study akin to artistic training. Microsoft's AI top executive thought it would be a good idea to add fuel to the fire by making some bold claims about what companies can legally do with online content when training their AI systems.
Mustafa Suleyman, who's been heading Microsoft's AI efforts since March, told CNBC in an interview that material published openly on the web essentially becomes "freeware" that anyone can copy and use as they please.
"I think that with respect to content that's already on the open web, the social contract of that content since the '90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it," he stated. "That has been 'freeware,' if you like, that's been the understanding."
That's certainly a spicy take – and an inaccurate one – you only need to look at the FAQ page from the US Copyright Office. One answer therein states that "your work is under copyright protection the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device."
The same FAQ adds that you do not even need to register "to be protected." The only time registration is needed is when you wish to file a lawsuit for infringement. So it's safe to say fair use doesn't come from any "social contract" as Suleyman suggests.
Suleyman did seemingly acknowledge the importance of the robots.txt file, stating that mentioning "do not scrape or crawl" on a website might make scraping a "grey area." But adhering to this basic protocol blocking web crawlers is more of a courtesy, not something that needs to "work its way through the courts," as he suggested.
Not surprisingly, even robots.txt is being ignored by various AI companies including Anthropic, Perplexity, and OpenAI.
This isn't the first time an executive working on AI advancement has made controversial claims. A big reason behind the prevalence of such statements is likely that despite over a year since ChatGPT's launch, the legal grounds are still being mapped out regarding training data and copyright.
Microsoft and partner OpenAI are indeed facing multiple lawsuits from publishers over allegations of using copyrighted online articles to train their powerful language models without permission. However, these cases have yet to reach final resolutions that could provide more legal clarity.
Suleyman's statements reflect a view of AI's scraping of the internet similar to how artists have always studied great works while learning their craft. "What are we, collectively, as an organism of humans, other than a knowledge and intellectual production engine?" he mused in the same interview.
However, the difference between AI and artists is that only one is capable of ingesting and regurgitating the world's content into profitable AI products and services on an unprecedented scale.