this post was submitted on 19 May 2026
329 points (96.3% liked)

Technology

84796 readers
4301 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] Xaphanos@lemmy.world 31 points 1 day ago (5 children)

I'd really like to know how they handle all the small-scale HW issues. As a DC tech, I'm kept quite busy with those

[–] username_1@programming.dev 20 points 1 day ago (1 children)

I bet they duplicate everything and just switch off faulty units. Every year or so, they would emerge the whole thing and replace what they need at a large scale.

[–] frongt@lemmy.zip 7 points 1 day ago (1 children)

Sounds expensive. I'm betting they just abandon it and sink a new one with new, faster hardware.

I'm thinking they replace the module every so often when it fails

[–] clay_pidgin@sh.itjust.works 12 points 1 day ago* (last edited 1 day ago) (1 children)

There was an Intel experiment a while back where they left a bunch of racks in the parking lot. They found that the failure rate wasn't much higher than inside, and not needing a data center building saved money. Maybe this project just accepts the eventual failure of components.

[–] snooggums@piefed.world 10 points 1 day ago (2 children)

Sure, if there is zero weather a building wouldn't be needed.

[–] rob_t_firefly@lemmy.world 18 points 1 day ago* (last edited 1 day ago)

Computers can't get wet from the rain if they're underwater in the ocean.

👉😏

[–] clay_pidgin@sh.itjust.works 6 points 1 day ago

They did get rained on. I am having trouble finding an article about it now.

[–] Pieisawesome@lemmy.dbzer0.com 4 points 1 day ago (1 children)

Larger DCs don’t replace individual components, they wait until a percentage of servers on a rack have failed, then replace the rack or servers.

They will likely adopt this same model

[–] frightful5680@lemmy.world 9 points 1 day ago

Probably have diver it techs.. boy that's kind of cool

[–] brucethemoose@lemmy.world 7 points 1 day ago* (last edited 1 day ago)

They’re probably stacks of 8x NPU Huawei servers all cooperatively serving the same few models.

As an older example, I believe Deepseek V3 was most optimally served with ~384 GPUs in a single cluster, before they switched to Chinese NPUs. So they’d have some software that ties all these together as one “server” and maybe multiple of those all serving API requests for one endpoint.

But it doesn’t actually need all 384 in each server. Many models will fit in a single 8-GPU/NPU server, but the software pools more just to try and utilize the hardware better.

If one server fails, the system would return a few requests as empty and have to restart the serving software, but… that’s fine. All the data is ephemeral. Even if the whole 24MW unit fails, they can just route API requests somewhere else, and a few failed generations isn’t a big deal.