ancuuiqter

joined 2 years ago
 

cross-posted from: https://lemmy.world/post/1330512

Below are direct quotes from the filings.

OpenAI

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-4ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Meta

Bibliotik is one of a number of notorious “shadow library” websites that also includes Library Genesis (aka LibGen), Z-Library (aka B-ok), and Sci-Hub. The books and other materials aggregated by these websites have also been available in bulk via torrent systems. These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal.

This article from Ars Tecnica covers a few more details. Filings are viewable at the law firm's site here.

 

Below are direct quotes from the filings.

OpenAI

As noted in Paragraph 32, supra, the OpenAI Books2 dataset can be estimated to contain about 294,000 titles. The only “internet-based books corpora” that have ever offered that much material are notorious “shadow library” websites like Library Genesis (aka LibGen), Z-Library (aka B-4ok), Sci-Hub, and Bibliotik. The books aggregated by these websites have also been available in bulk via torrent systems. These flagrantly illegal shadow libraries have long been of interest to the AI-training community: for instance, an AI training dataset published in December 2020 by EleutherAI called “Books3” includes a recreation of the Bibliotik collection and contains nearly 200,000 books. On information and belief, the OpenAI Books2 dataset includes books copied from these “shadow libraries,” because those are the most sources of trainable books most similar in nature and size to OpenAI’s description of Books2.

Meta

Bibliotik is one of a number of notorious “shadow library” websites that also includes Library Genesis (aka LibGen), Z-Library (aka B-ok), and Sci-Hub. The books and other materials aggregated by these websites have also been available in bulk via torrent systems. These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host. For that reason, these shadow libraries are also flagrantly illegal.

This article from Ars Tecnica covers a few more details. Filings are viewable at the law firm's site here.

1
submitted 2 years ago* (last edited 2 years ago) by ancuuiqter@lemmy.world to c/leftpiracy@lemmygrad.ml
 

cross-posted from: https://lemmy.world/post/274818

tldr: a huge torrent of books straight from the academic publisher De Gruyter

cross-posted from: https://teddit.net/r/DataHoarder/comments/1463ah3/degruyter_collection/

Trying again because reddit filters, let's see if I get it right this time.

After months of scraping I finally finished downloading almost every single degruyter book to which I have access, which are many. And so I created a torrent.

magnet:?xt=urn:btih:76f573241a0126fb1ab0aa5540cc7493c045ae74&dn=Degruyter%20Imprints%20v2%20%5b09-06-23%5d&tr=http%3a%2f%2fatrack.pow7.com%2fannounce&tr=udp%3a%2f%2fopen.stealth.si%3a80%2fannounce&tr=udp%3a%2f%2ftracker.cyberia.is%3a6969%2fannounce&tr=udp%3a%2f%2fretracker.lanta-net.ru%3a2710%2fannounce&tr=udp%3a%2f%2ftracker.moeking.me%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.tiny-vps.com%3a6969%2fannounce&tr=udp%3a%2f%2ftracker.torrent.eu.org%3a451%2fannounce&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=http%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=udp%3a%2f%2fopentra

The content of the torrent is pretty much these https://www.degruyter.com/search?query=*&startItem=0&pageSize=10&sortBy=mostrecent&documentVisibility=all&documentTypeFacet=book&publisherFacet=De+Gruyter%7EDe+Gruyter+Oldenbourg%7EDe+Gruyter+Saur%7EDe+Gruyter+Mouton

Neither the files nor the series are renamed. They all bear the original filenaming which is much better since it can easily be keyed to the book's unique page.

Note: this torrent includes only the abovementioned degruyter imprints. In the upcoming weeks I will create a second torrent, with the degruyter partner publishers: about 100k books from these publishers

https://i.imgur.com/mSKrLto.png And in few months the last torrent which will include the degruyter journals.

This endeavour would have not been possible were not for all the people that granted me academic access, wrote scripts for me, helped me ensure the integrity of the files, and so on. Sadly many files, especially epubs, are corrupted or downright missing at the source. There are also some dupes that, being part of multiple series and subseries, were downloaded twice. Total torrent size is about 2tb.

https://i.imgur.com/BjqsUqJ.png of course everything is actively being shared with nexus/annas/libgen as well as private groups and friends and the classicist discord channel. I did not waste months so I could jerkoff on the big numbah, therefore I want the files to be shared and reshared as much as possible in order to grant indirect academic access to all those students (me being one) whose universities cannot afford degruyter subscriptions.

Last note: all the files are retail untouched (according to BIB standards, if anyone here is a member). So if some epub has shitty formatting blame the publisher.

: fixed magnet url