CONTRIBUTING.md — Linguistic Contributors Guide¶
🌱 Welcome¶
Thank you for contributing to Zomi‑Syl, a community‑driven project to build the first high‑quality, reproducible NLP tools for the Zomi language.
This guide is for linguistic contributors — speakers, annotators, and linguists who help with:
- syllabification
- tone classification
- morphological segmentation
- lexicon entries
- example sentences
- variant checking
- dataset corrections
You do not need programming experience to contribute.
🧭 What You Can Contribute¶
Linguistic contributors can help with:
- Correcting syllabification
- Providing tone‑accurate examples
- Identifying ambiguous or irregular words
- Adding lexicon entries
- Providing example sentences
- Correcting morphological segmentation
- Reporting inconsistencies
- Validating golden regression outputs
If you want to explore specific areas:
- syllable variants
- tone classes
- derivational suffixes
- lexicon entry structure
📦 Project Structure (Linguist‑Relevant)¶
data/ → small linguistic metadata files
training/data/ → cleaned syllabification datasets
tests/golden/ → golden syllabification outputs
src/zomi_syl/lexicon/ → lexicon entries (future)
You will mostly interact with:
- Google Sheets (syllabification dataset)
- TSV files (syllable variants, tone lists)
- golden regression outputs (for validation)
📝 How to Contribute Linguistic Data¶
1. Syllabification Corrections¶
If you find a word that is syllabified incorrectly:
- provide the correct syllabification
- include tone if known
- include a short example sentence (optional)
Example:
Word: themthum
Correct: them-thum
Tone: H-L
Example: A themthum a om.
2. Tone Corrections¶
Tone is essential for:
- rule‑based syllabification
- CRF training
- lexicon entries
- morphological analysis
If you correct tone:
- specify the tone class (H, L, F, R, etc.)
- provide minimal pairs if possible
- note dialect differences if relevant
3. Morphological Segmentation¶
If you contribute segmentation:
- segment using hyphens
- identify derivational suffixes
- note tone changes
- provide glosses if possible
Example:
khaang-vui
khaang = dry
-vui = become (inchoative)
4. Lexicon Entries¶
When adding or correcting lexicon entries:
- provide lemma
- provide definition
- provide example sentence
- provide tone
- provide morphological notes if relevant
If you want to see the full contributor template:
- lexicon entry template
5. Identifying Ambiguous Words¶
Ambiguous words are words with multiple valid syllabifications.
These must not appear in the golden regression set.
If you find one:
- list all valid variants
- provide examples for each
- note dialect differences
Example:
Word: suahin
Variants:
1. sua-hin
2. suah-in
🧪 Validating Golden Regression¶
Linguistic contributors help ensure the golden set is:
- correct
- consistent
- tone‑accurate
- free of ambiguous words
To validate:
- Look at
tests/golden/crf_golden.tsv - Check each word’s syllabification
- Report any incorrect or ambiguous entries
You can also request a CRF prediction check:
- check CRF output
📣 Reporting Issues¶
When reporting a linguistic issue, include:
- the word
- expected syllabification
- tone
- dialect (if relevant)
- example sentence
- explanation of the issue
Example:
Issue: Incorrect syllabification
Word: upa
Expected: u-pa
Actual: up-a
Dialect: Tedim
Notes: The vowel boundary is misidentified.
🤝 Collaboration Principles¶
- Respect dialect diversity¶
- Provide examples when possible
- Explain reasoning clearly
- Avoid prescriptive judgments
- Focus on linguistic evidence
- Be patient with model limitations
❤️ Thank You¶
Your contributions help build the first comprehensive, community‑driven NLP toolkit for the Zomi language.
Every correction, example, and insight strengthens the linguistic foundation of the project.