this post was submitted on 29 Apr 2025
557 points (97.4% liked)

Technology

69545 readers
3904 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

The one-liner:

dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz

This is brilliant.

top 50 comments
sorted by: hot top controversial new old
[–] Treczoks@lemmy.world 19 points 18 hours ago

Have you ever heard of sparse files, and how Linux and Windows deal with zips of it? You'll love this.

[–] fmstrat@lemmy.nowsci.com 37 points 1 day ago (3 children)

I've been thinking about making an nginx plugin that randomizes words on a page to poison AI scrapers.

[–] owsei@programming.dev 16 points 15 hours ago (1 children)

There are "AI mazes" that do that.

I remember reading and article about this but haven't found it yet

[–] corsicanguppy@lemmy.ca 3 points 7 hours ago

The one below, named Anubis. Is the one I heard about. Come back to the thread and check the link.

[–] delusion@lemmy.myserv.one 13 points 17 hours ago (1 children)
[–] fmstrat@lemmy.nowsci.com 5 points 13 hours ago

That is a very interesting git repo. Is this just a web view into the actual git folder?

[–] some_guy@lemmy.sdf.org 5 points 1 day ago

If you have the time, I think it's a great idea.

[–] dwt@feddit.org 72 points 1 day ago (3 children)

Sadly about the only thing that reliably helps against malicious crawlers is Anubis

https://anubis.techaro.lol/

[–] spicehoarder@lemm.ee 2 points 6 hours ago

I don't really like this approach, not just because I was flagged as a bot, but because I don't really like captchas. I swear I'm not a bot guys!

[–] alehel@lemmy.zip 28 points 1 day ago (7 children)

That URL is telling me "Invalid response". Am I a bot?

[–] doorknob88@lemmy.world 87 points 1 day ago

I’m sorry you had to find out this way.

load more comments (6 replies)
[–] moopet@sh.itjust.works 33 points 1 day ago (2 children)

I'd be amazed if this works, since these sorts of tricks have been around since dinosaurs ruled the Earth, and most bots will use pretty modern zip libraries which will just return "nope" or throw an exception, which will be treated exactly the same way any corrupt file is - for example a site saying it's serving a zip file but the contents are a generic 404 html file, which is not uncommon.

Also, be careful because you could destroy your own device? What the hell? No. Unless you're using dd backwards and as root, you can't do anything bad, and even then it's the drive contents you overwrite, not the device you "destroy".

[–] namingthingsiseasy@programming.dev 10 points 11 hours ago (1 children)

On the other hand, there are lots of bots scraping Wikipedia even though it's easy to download the entire website as a single archive.

So they're not really that smart....

[–] Lucien@mander.xyz 4 points 14 hours ago

Yeah, this article came across as if written by a complete beginner. They mention having their WordPress hacked, but failed to admit it was because they didn't upgrade the install.

[–] Bishma@discuss.tchncs.de 102 points 2 days ago (1 children)

When I was serving high volume sites (that were targeted by scrapers) I had a collection of files in CDN that contained nothing but the word "no" over and over. Scrapers who barely hit our detection thresholds saw all their requests go to the 50M version. Super aggressive scrapers got the 10G version. And the scripts that just wouldn't stop got the 50G version.

It didn't move the needle on budget, but hopefully it cost them.

[–] sugar_in_your_tea@sh.itjust.works 31 points 1 day ago (1 children)

How do you tell scrapers from regular traffic?

[–] Bishma@discuss.tchncs.de 62 points 1 day ago (4 children)

Most often because they don't download any of the css of external js files from the pages they scrape. But there are a lot of other patterns you can detect once you have their traffic logs loaded in a time series database. I used an ELK stack back in the day.

load more comments (4 replies)
[–] arc@lemm.ee 26 points 1 day ago (5 children)

Probably only works for dumb bots and I'm guessing the big ones are resilient to this sort of thing.

Judging from recent stories the big threat is bots scraping for AIs and I wonder if there is a way to poison content so any AI ingesting it becomes dumber. e.g. text which is nonsensical or filled with counter information, trap phrases that reveal any AIs that ingested it, garbage pictures that purport to show something they don't etc.

[–] frezik@midwest.social 35 points 1 day ago (1 children)

When it comes to attacks on the Internet, doing simple things to get rid of the stupid bots means kicking 90% of attacks out. No, it won't work against a determined foe, but it does something useful.

Same goes for setting SSH to a random port. Logs are so much cleaner after doing that.

[–] airgapped@piefed.social 14 points 20 hours ago (1 children)

Setting a random SSH port and limiting it to 3/min saw failed login attempts fall by 99% and jailed IPs fall to 0.

[–] WFloyd@lemmy.world 4 points 8 hours ago

I've found great success using a hardened ssh config with a limited set of supported Cyphers/MACs/KexAlgorithms. Nothing ever gets far enough to even trigger fail2ban. Then of course it's key only login from there.

[–] echodot@feddit.uk 8 points 1 day ago* (last edited 1 day ago)

I don't know as to poisoning AI, but one thing that I used to do was to redirect any suspicious bots or ones that were hitting their server too much to a simple html page with no JS or CSS or forward links. Then they used to go away.

[–] mostlikelyaperson@lemmy.world 11 points 1 day ago (1 children)

There have been some attempts in that regard, I don’t remember the names of the projects, but there were one or two that’d basically generate a crapton of nonsense to do just that. No idea how well that works.

[–] palordrolap@fedia.io 113 points 2 days ago (4 children)

The article writer kind of complains that they're having to serve a 10MB file, which is the result of the gzip compression. If that's a problem, they could switch to bzip2. It's available pretty much everywhere that gzip is available and it packs the 10GB down to 7506 bytes.

That's not a typo. bzip2 is way better with highly redundant data.

[–] Xanza@lemm.ee 3 points 9 hours ago* (last edited 9 hours ago) (1 children)

zstd is a significantly better option than anything else available unless you need something specific for a specific reason: https://github.com/facebook/zstd?tab=readme-ov-file#benchmarks

LZ4 is likely better than zstd, but it doesn't have wide usability yet.

[–] palordrolap@fedia.io 1 points 8 hours ago (1 children)

You might be thinking of lzip rather than lz4. Both compress, but the former is meant for high compression whereas the latter is meant for speed. Neither are particularly good at dealing with highly redundant data though, if my testing is anything to go by.

Either way, none of those are installed as standard in my distro. xz (which is lzma based) is installed as standard but, like lzip, is slow, and zstd is still pretty new to some distros, so the recipient could conceivably not have that installed either.

bzip2 is ancient and almost always available at this point, which is why I figured it would be the best option to stand in for gzip.

As it turns out, the question was one of data streams not files, and as at least one other person pointed out, brotli is often available for streams where bzip2 isn't. That's also not installed by default as a command line tool, but it may well be that the recipient, while attempting to emulate a browser, might have actually installed it.

[–] Xanza@lemm.ee 1 points 5 hours ago

No. https://github.com/lz4/lz4

LZ4 already has a caddy layer which interprets and compresses data streams for caddy: https://github.com/mholt/caddy-l4

It's also very impressive.

[–] just_another_person@lemmy.world 95 points 2 days ago* (last edited 1 day ago) (2 children)

I believe he's returning a gzip HTTP response stream, not just a file payload that the requester then downloads and decompresses.

Bzip isn't used in HTTP compression.

[–] bss03@infosec.pub 3 points 7 hours ago

For scrapers that not just implementing HTTP, but are trying to extract zip files, you can possibly drive them insane with zip quines: https://github.com/ruvmello/zip-quine-generator or otherwise compressed files that contain themselves at some level of nesting, possibly with other data so that they recursively expand to an unbounded ("infinite") size.

[–] sugar_in_your_tea@sh.itjust.works 30 points 1 day ago* (last edited 1 day ago)

Brotli is an option, and it's comparable to Bzip. Brotli works in most browsers, so hopefully these bots would support it.

I just tested it, and a 10G file full of zeroes is only 8.3K compressed. That's pretty good, though a little bigger than BZip.

[–] sugar_in_your_tea@sh.itjust.works 27 points 1 day ago (3 children)

Brotli gets it to 8.3K, and is supported in most browsers, so there's a chance scrapers also support it.

load more comments (3 replies)
load more comments (1 replies)
[–] lemmylommy@lemmy.world 71 points 2 days ago (6 children)

Before I tell you how to create a zip bomb, I do have to warn you that you can potentially crash and destroy your own device.

LOL. Destroy your device, kill the cat, what else?

[–] archonet@lemy.lol 45 points 2 days ago

destroy your device by... having to reboot it. the horror! The pain! The financial loss of downtime!

load more comments (5 replies)
[–] aesthelete@lemmy.world 25 points 1 day ago* (last edited 1 day ago) (1 children)

This reminds me of shitty FTP sites with ratios when I was on dial-up. I used to push them files full of null characters with filenames that looked like actual content. The modem would compress the upload as it transmitted it which allowed me to upload the junk files at several times the rate of a normal file.

load more comments (1 replies)
load more comments
view more: next ›