Accio Legal professionals! Microsoft supervisor educated AI on pirated Potter books
Abstract created by Good Solutions AI
In abstract:
- PCWorld stories {that a} Microsoft supervisor promoted coaching Azure AI on pirated Harry Potter books by means of a developer weblog submit that has since been eliminated.
- The incident highlights rising authorized considerations as authors more and more sue tech firms for utilizing copyrighted works with out permission to coach AI programs.
- This case underscores vital moral challenges in AI growth when copyrighted materials is badly used for machine studying coaching functions.
Oh, my. With “AI” programs inflicting lots of issues just about all over the place, it’s a foul search for one of many world’s most vital tech firms to actively promote piracy. However that seems to be simply what occurred, with a submit hosted on Microsoft’s developer weblog, actively utilizing an apparently pirated set of Harry Potter novels to coach an Azure-based “AI” system.
“The Harry Potter sequence, written by J.Okay. Rowling, is a globally beloved assortment of seven books that comply with the journey of a younger wizard, Harry Potter, and his mates as they battle the darkish forces led by the evil Voldemort,” wrote Pooja Kamath, a Microsoft Senior Product Supervisor. The weblog submit then pointed to a Kaggle dataset hyperlink that contained seven TXT information, apparently encompassing the complete printed novel sequence.
The weblog submit was a information on including generative “AI” to purposes by way of Azure. The supervisor stated that it may very well be used to create a Q&A system, or auto-generate Harry Potter fan fiction. “This characteristic is certain to please Potterheads, permitting them to discover new adventures and create their very own magical tales.” It closes with an LLM-generated picture of two youngsters on a practice, clearly caricatures of Harry Potter and Ron Weasley, with a Microsoft brand between them.
That is, in technical legalistic phrases, a massive frickin’ no-no. All of the Harry Potter novels are, in fact, held underneath copyright by varied entities around the globe, together with the creator. A fast browse on Amazon exhibits {that a} full assortment prices $70 USD in e-book format on the time of writing. Internet hosting or downloading the information free of charge with out paying any type of royalty is a criminal offense mainly all over the place. Sure, that features downloading it even when all you propose to do is plug it into a big language mannequin.
The unique Microsoft how-to submit was printed in late 2024, and has been faraway from the location (although it’s nonetheless accessible by way of the Web Archive). Ditto for the Kaggle dataset, which was mistakenly marked as “public area” and solely downloaded about 10,000 occasions, in line with a report from Ars Technica. Each the weblog submit and the pirated information set appear to have flown underneath the radar for a 12 months and a half, till a Hacker Information thread yesterday introduced new consideration to them.
It’s surprising {that a} Microsoft supervisor can be so informal about e-book piracy in a public submit on a Microsoft weblog (although Kamath might not perceive how the general public area system works and assumed the information had been marked appropriately.). However the most well-liked massive language fashions have been educated on tens of millions of ebooks, many (probably even a majority) of which have been downloaded by way of unlawful piracy.
Authors have filed lawsuits towards Meta/Fb, OpenAI, Nvidia, Alphabet/Google, Anthropic, Microsoft, and others, aiming to cease coaching on copyrighted works and/or search remuneration for books already integrated into LLM coaching with out permission. Preliminary ends in the courts have been blended, typically discovering the outcomes of coaching fashions “transformative” and thus substantively completely different from the core information, i.e., truthful use, and a few discovering that preliminary acts of piracy should nonetheless be prosecuted.

