this post was submitted on 26 Nov 2025
403 points (96.8% liked)

Selfhosted

53222 readers
1539 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

Got a warning for my blog going over 100GB in bandwidth this month... which sounded incredibly unusual. My blog is text and a couple images and I haven't posted anything to it in ages... like how would that even be possible?

Turns out it's possible when you have crawlers going apeshit on your server. Am I even reading this right? 12,181 with 181 zeros at the end for 'Unknown robot'? This is actually bonkers.

Edit: As Thunraz points out below, there's a footnote that reads "Numbers after + are successful hits on 'robots.txt' files" and not scientific notation.

Edit 2: After doing more digging, the culprit is a post where I shared a few wallpapers for download. The bots have been downloading these wallpapers over and over, using 100GB of bandwidth usage in the first 12 days of November. That's when my account was suspended for exceeding bandwidth (it's an artificial limit I put on there awhile back and forgot about...) that's also why the 'last visit' for all the bots is November 12th.

(page 2) 42 comments
sorted by: hot top controversial new old
[–] irmadlad@lemmy.world 14 points 3 days ago

Unknown Robot is your biggest fan.

[–] WolfLink@sh.itjust.works 8 points 2 days ago (1 children)

This is why I use CloudFlare. They block the worst and cache for me to reduce the load of the rest. It’s not 100% but it does help.

[–] irmadlad@lemmy.world 4 points 2 days ago

LOL Someone took exception to your use of Cloudflare. Hilarious. Anyways, yeah, what Cloudflare doesn't get, pFsense does.

[–] hdsrob@lemmy.world 3 points 2 days ago

Had the same thing happen on one of my servers. Got up one day a few weeks ago and the server was suspended (luckily the hosting provider unsuspended it for me quickly).

It's mostly business sites, but we do have an old personal blog on there with a lot of travel pictures on it, and 4 or 5 AI bots were just pounding it. Went from 300GB per month average to 5TB on August, and 10/11 TB in September and October.

[–] Eyekaytee@aussie.zone 8 points 3 days ago

does your blog have a blackhole in it somewhere you forgot about 😄

[–] Vorpal@programming.dev 5 points 2 days ago (1 children)

What is that log analysis tool you are using in the picture? Looks pretty neat.

[–] benagain@lemmy.ml 4 points 2 days ago (1 children)

It's a mix, I put two screenshots together. On the left is my monthly bandwidth usage from CPanel on the right is Awstats (though I hid some sections so the Robots/Spiders section was closer to the top).

[–] SlurpingPus@lemmy.world 1 points 2 days ago (1 children)

Awstats

I thought I recognized it. Hell of a blast from the past, haven't seen it in fifteen years at least.

[–] benagain@lemmy.ml 1 points 2 days ago (1 children)

I think they're winding down the project unfortunately, so I might have to get with the times...

[–] SlurpingPus@lemmy.world 1 points 2 days ago* (last edited 2 days ago)

I mean, I thought it was long dead. It's twenty-five years old, and the web has changed quite a bit in that time. No one uses Perl anymore, for starters. I used Open Web Analytics, Webalizer, or somesuch by 2008 or so. I remember Webalizer being snappy as heck.

I tinkered with log analysis myself back then, peeping into the source of AWStats and others. Learned that a humongous regexp with like two hundred alternative matches for the user-agent string was way faster than trying to match them individually — which of course makes sense seeing as regexps work as state-machines in a sort of a very specialized VM. My first attempts, in comparison, were laughably naive and slow. Ah, what a time.

Sure enough, working on a high-traffic site taught me that it's way more efficient to prepare data for reading at the moment of change instead of when it's being read — which translates to analyzing visits on the fly and writing to an optimized database like ElasticSearch.

[–] ohshit604@sh.itjust.works 5 points 2 days ago

I just geo-restrict my server to my country, certain services I’ll run an ip-blacklist and only whitelist the known few networks.

Works okay I suppose, kills the need for a WAF, haven’t had any issues with it.

[–] Bazell@lemmy.zip 2 points 2 days ago

Looks for me like actions of AI agents.

[–] drkt@scribe.disroot.org 1 points 2 days ago

You have to grow spikes and make it painful for bots to crawl your site. It sucks, and it costs a lot of extra bandwidth for a few months, but eventually they all blacklist your site and leave you alone.

load more comments
view more: ‹ prev next ›