Wikipedia is making a dataset for training ai because it’s overwhelmed by bots

It seems that ai developers have essentially blackmailed wikipedia into offering up its data for training. On Wednsday, The Wikimedia Foundation announced It is partnering with Google-Owned Kaggle-A Popular Data Science Community Platform-To Release a version of wikipedia optimized for training ai models. Starting with English and French, The Foundation will offer stripped down versions of raw wikipedia text, excluding any references or markdown code.

Being a Non-Profit, Volunteer-Led Platform, Wikipedia Monetizes Through Donations and does not own the content it hosts, allowing anyone to use and remix content from the platform. It is fin with other Organizations Using Its Vast Corpus of Knowledge for All Sorts of Cases –KiwixFor example, is an offline version of wikipedia that has been used to smuggle information into North Korea.

But a Flood of Bots Constantly Trawling Its Website for Ai Training Needs Has LED to a Surge in Non-Human Traffic to Wikipedia, Something it was interested in addressing in address Earlier this month, the foundation said bandwidth consumption has Increased 50% Since January 2024. Offering a Standard, JSON-formatted version of wikipedia articles Should Dissuade Ai Developers from Bombarding Its Website.

“As the place the machine learning community come for tools and tests, kaggle is extramely excted to be the host for the wikimedia foundation’s data,” Kaggle Partnerships Lead Brenda Flynna Told The Verge“Kaggle is excited to play a role in keeping this data accessible, available, and useful.”

It is no secrets There is a rising school of thought in the ai industry that all content should be free and that it from it from anyway It ITO SOMETHING Entrely New.

But someone has to create the content in the first place, which is not cheap, and ai startups have ben all too willing to ignore previously accepted Norms AROUND NOREMS AROUND NOREMS AROUND Language models that production human-like text outputs need to be trained on Vast Amounts of Material, and Training Data has become smiling akin to oil in the ai boom. It is well know that the leading models are trained Using copyrighted worksAnd Several Ai Companies Remain in Litigation over the issue.

Some contributors to wikipedia may dislike their content being made available for ai training. All Writing on the website is licensed under the Creative Commons Attribution-Sharealike License, which allows anyone to Freely Share, Adapt, and Build upon a work, even commercially, as long as they creed the original creator and license their derivative works under the same terms. It is unclear how wikimedia would ensure ai companies respect these requirements, but gizmodo has reacted out for comment.

Leave a Comment Cancel reply