LLMs are an obvious dead end when it comes to actual "intelligence" or understanding how the world works.
But, this sounds like a "draw the rest of the owl" situation.
"JEPA learns abstract representations of how the world works, ignoring unpredictable surface detail."
Oh, it's that simple is it? Just have it "learn abstract representations of how the world works". Amazing how nobody thought to do that before!
I think I understand the distinction they're trying to draw. Current models are trained on billions of pictures of cats and billions of pictures of dogs. You feed it an image of Fido and it finds a point in 2500 dimensional space and knows whether that point is in the "cat space" or "dog space". It can be very good, but it doesn't have any "understanding" of what makes something a cat vs. a dog. Humans, OTOH, aren't trained on billions of images. But, they learn about things like "teeth" and "whiskers" and "snouts" and "eyes". Within their knowledge of eyes, they spot that vertical slit pupils are unusual and different, and part of what makes something "catlike". AFAIK, nobody has ever managed to create a system that learns abstract features without intensive human training.
I like that they're trying something new. But, are they counting on a massive breakthrough on a problem that has existed since people first started theorizing about AI? Or, is it just a matter of refining a known process?