this post was submitted on 24 Jun 2025
622 points (98.9% liked)

Technology

71885 readers
4829 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
top 50 comments
sorted by: hot top controversial new old
[–] booly@sh.itjust.works 9 points 1 hour ago

It took me a few days to get the time to read the actual court ruling but here's the basics of what it ruled (and what it didn't rule on):

  • It's legal to scan physical books you already own and keep a digital library of those scanned books, even if the copyright holder didn't give permission. And even if you bought the books used, for very cheap, in bulk.
  • It's legal to keep all the book data in an internal database for use within the company, as a central library of works accessible only within the company.
  • It's legal to prepare those digital copies for potential use as training material for LLMs, including recognizing the text, performing cleanup on scanning/recognition errors, categorizing and cataloguing them to make editorial decisions on which works to include in which training sets, tokenizing them for the actual LLM technology, etc. This remains legal even for the copies that are excluded from training for whatever reason, as the entire bulk process may involve text that ends up not being used, but the process itself is fair use.
  • It's legal to use that book text to create large language models that power services that are commercially sold to the public, as long as there are safeguards that prevent the LLMs from publishing large portions of a single copyrighted work without the copyright holder's permission.
  • It's illegal to download unauthorized copies of copyrighted books from the internet, without the copyright holder's permission.

Here's what it didn't rule on:

  • Is it legal to distribute large chunks of copyrighted text through one of these LLMs, such as when a user asks a chatbot to recite an entire copyrighted work that is in its training set? (The opinion suggests that it probably isn't legal, and relies heavily on the dividing line of how Google Books does it, by scanning and analyzing an entire copyrighted work but blocking users from retrieving more than a few snippets from those works).
  • Is it legal to give anyone outside the company access to the digitized central library assembled by the company from printed copies?
  • Is it legal to crawl publicly available digital data to build a library from text already digitized by someone else? (The answer may matter depending on whether there is an authorized method for obtaining that data, or whether the copyright holder refuses to license that copying).

So it's a pretty important ruling, in my opinion. It's a clear green light to the idea of digitizing and archiving copyrighted works without the copyright holder's permission, as long as you first own a legal copy in the first place. And it's a green light to using copyrighted works for training AI models, as long as you compiled that database of copyrighted works in a legal way.

[–] Fizz@lemmy.nz 21 points 14 hours ago

Judge,I'm pirating them to train ai not to consume for my own personal use.

[–] Randomgal@lemmy.ca 35 points 18 hours ago (1 children)

You're poor? Fuck you you have to pay to breathe.

Millionaire? Whatever you want daddy uwu

[–] eestileib@lemmy.blahaj.zone 2 points 1 hour ago

That's kind of how I read it too.

But as a side effect it means you're still allowed to photograph your own books at home as a private citizen if you own them.

Prepare to never legally own another piece of media in your life. 😄

[–] DFX4509B_2@lemmy.org 9 points 16 hours ago* (last edited 16 hours ago) (2 children)

Good luck breaking down people's doors for scanning their own physical books for their personal use when analog media has no DRM and can't phone home, and paper books are an analog medium.

That would be like kicking down people's doors for needle-dropping their LPs to FLAC for their own use and to preserve the physical records as vinyl wears down every time it's played back.

[–] Bob_Robertson_IX@discuss.tchncs.de 4 points 12 hours ago (1 children)

It sounds like transferring an owned print book to digital and using it to train AI was deemed permissable. But downloading a book from the Internet and using it was training data is not allowed, even if you later purchase the pirated book. So, no one will be knocking down your door for scanning your books.

This does raise an interesting case where libraries could end up training and distributing public domain AI models.

[–] restingboredface@sh.itjust.works 1 points 8 minutes ago

I would actually be okay with libraries having those AI services. Even if they were available only for a fee it would be absurdly low and still waived for people with low or no income.

[–] booly@sh.itjust.works 1 points 12 hours ago (1 children)

The ruling explicitly says that scanning books and keeping/using those digital copies is legal.

The piracy found to be illegal was downloading unauthorized copies of books from the internet for free.

[–] deltapi@lemmy.world 1 points 10 hours ago (1 children)

I wonder if the archive.org cases had any bearing on the decision.

[–] booly@sh.itjust.works 2 points 48 minutes ago

Archive.org was distributing the books themselves to users. Anthropic argued (and the authors suing them weren't able to show otherwise) that their software prevents users from actually retrieving books out of the LLM, and that it only will produce snippets of text from copyrighted works. And producing snippets in the context of something else is fair use, like commentary or criticism.

[–] MTK@lemmy.world 19 points 18 hours ago (2 children)

Check out my new site TheAIBay, you search for content and an LLM that was trained on reproducing it gives it to you, a small hash check is used to validate accuracy. It is now legal.

[–] booly@sh.itjust.works 1 points 1 hour ago

The court's ruling explicitly depended on the fact that Anthropic does not allow users to retrieve significant chunks of copyrighted text. It used the entire copyrighted work to train the weights of the LLMs, but is configured not to actually copy those works out to the public user. The ruling says that if the copyright holders later develop evidence that it is possible to retrieve entire copyrighted works, or significant portions of a work, then they will have the right sue over those facts.

But the facts before the court were that Anthropic's LLMs have safeguards against distributing copies of identifiable copyrighted works to its users.

[–] nodiratime@lemmy.world 4 points 18 hours ago* (last edited 18 hours ago) (2 children)

Does it "generate" a 1:1 copy?

[–] MTK@lemmy.world 2 points 9 hours ago

You can train an LLM to generate 1:1 copies

[–] y0kai@lemmy.dbzer0.com 12 points 20 hours ago (1 children)

Sure, if your purchase your training material, it's not a copyright infringement to read it.

We needed a judge for this?

[–] excral@feddit.org 14 points 19 hours ago

Yes, because just because you bought a book you don't own its content. You're not allowed to print and/or sell additional copies or publicly post the entire text. Generally it's difficult to say where the limit is of what's allowed. Citing a single sentence in a public posting is most likely fine, citing an entire paragraph is probably fine, too, but an entire chapter would probably be pushing it too far. And when in doubt a judge must decide how far you can go before infringing copyright. There are good arguments to be made that just buying a book doesn't grant the right to train commercial AI models with it.

[–] SaharaMaleikuhm@feddit.org 36 points 1 day ago (3 children)

But I thought they admitted to torrenting terabytes of ebooks?

[–] FaceDeer@fedia.io 15 points 21 hours ago

That part is not what this preliminary jugement is about. The torrenting part is going to go to an actual trial. This part was about the Authors' claim that the act of training AI itself violated copyright, and this is what the judge has found to be incorrect.

[–] antonim@lemmy.dbzer0.com 13 points 1 day ago (1 children)

Facebook (Meta) torrented TBs from Libgen, and their internal chats leaked so we know about that, and IIRC they've been sued. Maybe you're thinking of that case?

[–] ScoffingLizard@lemmy.dbzer0.com 2 points 4 hours ago

Billions of dollars, and they can't afford to buy ebooks?

load more comments (1 replies)
[–] isVeryLoud@lemmy.ca 38 points 1 day ago* (last edited 1 day ago) (25 children)

Gist:

What’s new: The Northern District of California has granted a summary judgment for Anthropic that the training use of the copyrighted books and the print-to-digital format change were both “fair use” (full order below box). However, the court also found that the pirated library copies that Anthropic collected could not be deemed as training copies, and therefore, the use of this material was not “fair”. The court also announced that it will have a trial on the pirated copies and any resulting damages, adding:

“That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages.”

load more comments (25 replies)
[–] yournamehere@lemm.ee 8 points 22 hours ago (5 children)

i will train my jailbroken kindle too...display and storage training... i'll just libgen them...no worries...it is not piracy

[–] minorkeys@lemmy.world 5 points 20 hours ago* (last edited 20 hours ago)

Of course we have to have a way to manually check the training data, in detail, as well. Not reading the book, im just verifying training data.

load more comments (4 replies)
[–] vane@lemmy.world 22 points 1 day ago* (last edited 1 day ago) (12 children)

Ok so you can buy books scan them or ebooks and use for AI training but you can't just download priated books from internet to train AI. Did I understood that correctly ?

load more comments (12 replies)
load more comments
view more: next ›