MC^2

Introduction

Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data.

To address this accessibility gap, we present MC², a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus of its kind so far. It includes four underrepresented languages: Tibetan, Uyghur, Kazakh (in the Kazakh Arabic script), and Mongolian (in the traditional Mongolian script).

Underrepresented Scripts: MC² focuses on the less common writing systems of Kazakh and Mongolian, i.e., Kazakh Arabic script and traditional Mongolian script, which have been long neglected in previous corpus construction efforts.
Quality-Centric Collection: Recognizing the prevalence of language contamination within existing corpora, we adopt a quality-centric solution for collecting MC², prioritizing accuracy while enhancing diversity.
Cultural Awareness: Through geo-cultural probing, We underscore the importance of attending to the multiplicity of writing systems, which is closely related to the cultural awareness of the resulting models.
Open Access: The MC² corpus and related models are made public to the community. The models trained with MC² perform comparably to those trained with closed-source corpora.

Quality Issues in Previous Corpora

When auditing previous multilingual web corpora for low-resource languages, we find critical quality issues. These defects pose a significant threat to effective model training and might undermine the credibility of research findings.

Language Misidentification: Current language identification tools are prone to error, especially on languages with similar writing systems. We find that in CulturaX, 16% of the data in the Uyghur subset is actually in Kazakh or Arabic, languages utilizing scripts akin to that of Uyghur.
Insufficient Data Cleaning: The web crawls in previous corpora often contain unwanted texts such as sidebars, headers, and footers. We sample 100 pages from the Tibetan corpus of CulturaX and our manual annotation shows that 42% contain headers or footers.

Misidentification of a Kazakh page as Uyghur in CulturaX

Insufficient data cleaning (gray) for a Tibetan web page in CulturaX

Quality-Centric Data Collection in MC²

We propose a quality-centric solution for data collection of low-resource languages, which aims to ensure accuracy while improving the comprehensiveness and coverage of the data. We hope to establish a reliable groundwork for subsequent language model training or linguistic research.

Our collection procedure of MC² mainly consists of three steps:

[Step 1] Web Crawling: Our corpus is mainly made up of web crawls. We combine both human labor and AI assistance to prevent the flaws in the previous corpora.
- Language Identification:We manually maintain a list of high-quality websites for each language of our study, to avoid language contamination resulting from mislabeling by identification tools.
- Text Extraction: For each website, we ask Github Copilot to analyze the HTML structure of a sampled web page and write a Python code to extract its title and main content. In this way, we can extract the wanted texts in the raw web crawls with high accuracy and efficiency.
[Step 2] Incorporation of Existing Datasets: Thanks to the existing efforts in the community, we incorporate open-source resources into our corpus, including CulturaX, Wikipedia, and NLGIW 2023 Shared Task.
[Step 3] Deduplication and Filtering: We take a series of measures of deduplication and filtering to ensure the high quality of our corpus.

We compare the size of MC² with other corpora in the table below.

MC² (crawl) denotes the subset of our newly-collected web crawls. MC² (full) is the complete set of our corpus, which additionally contains texts collected from existing resources.

^†For the Uyghur split of OSCAR and CulturaX, we report the data sizes after manual language re-identification.

Multiplicity of Writing Systems

Many languages adopt distinct writing systems across various regions. For instance, in China, minority languages such as Kazakh and Mongolian employ scripts that differ from the Cyrillic scripts used in Kazakhstan and Mongolia. Unfortunately, existing datasets predominantly concentrate on the more prevalent writing systems, neglecting the less common ones. In response to this issue, MC² is the first effort to collect native corpora for the two underrepresented writing systems, i.e., the Kazakh Arabic script and the traditional Mongolian script.

Comparison between the different writing systems of Kazakh (kk) and Mongolian (mn).

The sample texts mean hello. We report the data sizes in CulturaX.

To obtain a model for low-resource scripts, it is intuitive to transliterate the corpus in the high-resource scripts into low-resource ones for training. However, there are no one-to-one conversion rules between scripts for languages such as Mongolian. The transliteration between traditional and Cyrillic Mongolian is context-dependent and current open-source tools are far from perfect. Using noisy data transliterated from high-resource writing systems will greatly hinder the learning of low-resource writing systems.

Cultural Differences Behind Writing Systems

For some languages such as Kazakh, we can achieve perfect transliteration between different writing systems using pre-defined rules. Nevertheless, there exist disparities in the cultural backgrounds between the language variants using different scripts.

With the technique of probing, we investigate whether the training data collected from different writing systems will lead to distinct cultural knowledge in the resulting models.

We take the Kazakh language as our research target. The Kazakh community in China uses the Arabic script while the Cyrillic script is adopted in Kazakhstan.

We train two distinct Kazakh language models based on XLMRoBERTa-large, each tailored to one of the writing systems. One is trained with 900M authentic Cyrillic Kazakh texts from CulturaX. And the other is trained with an equivalent volume of Arabic Kazakh texts from our MC2 corpus.

We subject the two models trained on different scripts to the cultural probing questions, which reflect the cultural differences between the two Kazakh communities. We query the Arabic Kazakh model with questions written in the Arabic script. Similarly, for the Cyrillic Kazakh model, we use questions written in the Cyrillic script.

As shown in the following examples, the two models exhibit distinct cultural knowledge. The Arabic Kazakh model is more familiar with the Kazakh community in China, while the Cyrillic Kazakh model is more knowledgeable about the Kazakh community in Kazakhstan.

Example 1: Holiday

Example 2: Currency

Example 3: Geography

To demonstrate the practical value of our corpus, we train two models with MC² and compare their performance with competitive counterparts.

MC²XLMR-large: An encoder-only model continually trained with MC² based on XLM-RoBERTa-large.
MC²Llama-13B: A decoder-only model trained with MC² based on Llama-2-13B.

We compare our models with the following baselines:

Specialized models for minority languages in China: CINO-large-v2, an encoder-only model trained with in-house data.
General multilingual models:

Encoder-only: mBERT-base, XLM-RoBERTa-large
Encoder-decoder: mT5-xxl, ByT5-xxl
Decoder-only: BLOOM-7.1B

We mainly test on text classfication (WCM-v2) and question answering (TibetanQA). For decoder-only models, we adopt the zero-shot transfer setting, i.e., fine-tuning on English data and testing on other languages. For other models, we adopt in-context learning.

Performance of different models under the zero-shot transfer setting.

Performance of different models under the in-context learning setting.

The results show that our models achieve competitive performance compared to the baselines. Notably, MC²XLMR-large can exhibit comparable performance to CINO, which is trained on a closed-source corpus three times larger than MC². The continual pretraining with MC² is effective in enhancing the model's performance on low-resource languages.

Citation

@article{zhang2024mc,
    title={MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China},
    author={Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong},
    journal={arXiv preprint arXiv:2311.08348},
    year={2024}
}

MC²

Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Introduction

Data Collection

Quality Issues in Previous Corpora

Quality-Centric Data Collection in MC²

Cultural Consideration

Multiplicity of Writing Systems

Cultural Differences Behind Writing Systems

Continual Pretraining with MC²

Citation

MC2

Towards Transparent and Culturally-Aware NLP for Minority Languages in China

Introduction

Data Collection

Quality Issues in Previous Corpora

Quality-Centric Data Collection in MC2

Cultural Consideration

Multiplicity of Writing Systems

Cultural Differences Behind Writing Systems

Continual Pretraining with MC2

Citation

MC²

Quality-Centric Data Collection in MC²

Continual Pretraining with MC²