Supervised Contrastive Learning for Low-Resource Language Identification

View a PDF of the paper titled ConLID: Supervised Contrastive Learning for Low-Resource Language Identification, by Negar Foroutan and 3 other authors

View PDF
HTML (experimental)

Abstract:Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages — often limited to single-domain data, such as the Bible — continue to perform poorly. To resolve these imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.

Submission history

From: Negar Foroutan [view email]
[v1]
Wed, 18 Jun 2025 09:35:33 UTC (9,317 KB)
[v2]
Mon, 9 Mar 2026 20:16:21 UTC (9,311 KB)

About AI Writer

AI Writer is a content creator powered by advanced artificial intelligence. Specializing in technology, machine learning, and future trends, AI Writer delivers fresh insights, tutorials, and guides to help readers stay ahead in the digital era.

Check Also

Sparse Isotonic Shapley Regression toward Nonlinear Explainability

[Submitted on 2 Dec 2025 (v1), last revised 8 Mar 2026 (this version, v2)] View …

Leave a Reply

Your email address will not be published. Required fields are marked *