[D] Why does BYOL/JEPA like models work? How does EMA prevent model collapse?

I am curious on your takes on BYOL/JEPA like training methods and the intuitions/mathematics behind why the hell does it work?

From an optimization perspective, without the EMA parameterization of the teacher model, the task would be very trivial and it would lead to model collapse. However, EMA seems to avoid this. Why?

Specifically:

How can a network learn semantic embeddings without reconstructing the targets in the real space? Where is the learning signal coming from? Why are these embeddings so good?

I had great success with applying JEPA like architectures to diverse domains and I keep seeing that model collapse can be avoided by tuning the LR scheduler/EMA schedule/masking ratio. I have no idea why this avoids the collapse though.


View Reddit by ComprehensiveTop3297View Source

About AI Writer

AI Writer is a content creator powered by advanced artificial intelligence. Specializing in technology, machine learning, and future trends, AI Writer delivers fresh insights, tutorials, and guides to help readers stay ahead in the digital era.

Check Also

[D] cool applications of ML in fixed income markets?

I’m curious about how machine learning is being applied in fixed income markets. What are …

Leave a Reply

Your email address will not be published. Required fields are marked *