CONTRIBUTING.md — Infra Maintainers Guide¶

🔧 Purpose¶

This document defines the responsibilities, workflows, and rules for infra maintainers of the Zomi‑Syl project.
Infra maintainers are responsible for:

reproducible CRF training
golden regression stability
model freeze + packaging
Makefile orchestration
dataset ingestion + cleaning
release engineering
CI stability
preventing model drift

This guide ensures the entire pipeline remains deterministic and auditable.

📁 Infra‑Relevant Repository Structure¶

training/                    → CRF training pipeline
training/data/               → cleaned datasets
training/model/crf/          → temporary training outputs

scripts/                     → golden, freeze, validation scripts
scripts/release_crf_freeze.sh
scripts/get_golden_crf_frozen_data.py

src/zomi_syl/models/         → packaged inference models (only these go to PyPI)

tests/golden/                → golden regression snapshots
tests/                       → regression + backend tests

Makefile                     → orchestration for training, testing, release
MANIFEST.in                  → PyPI exclusion rules
.pre-commit-config.yaml      → formatting + safety hooks

🧱 Core Responsibilities¶

1. Maintain Reproducibility¶

Infra maintainers must ensure:

deterministic CRF training
stable feature extraction
stable golden regression
no accidental model drift
no stale artifacts in Git
no training models committed

Reproducibility is the highest priority.

2. Maintain Golden Regression¶

Golden regression is the contract between:

the packaged inference model
the test suite
the release pipeline

Infra maintainers must:

ensure golden files reflect the frozen model
ensure no ambiguous words appear in golden
regenerate golden after any model change
validate golden diffs before merging

Golden regeneration:

python scripts/get_golden_crf_frozen_data.py

Golden inspection:

make check-crf WORDS="amah upa zawlai"

3. Maintain the CRF Training Pipeline¶

Infra maintainers own:

dataset ingestion
dataset cleaning
feature extraction
training configuration
evaluation metrics
training reproducibility

Key commands:

make get-zomi-syllabified-human
make clean-dataset
make train-crf

Training outputs must never be committed.

4. Maintain the Release Freeze Workflow¶

The release freeze script:

scripts/release_crf_freeze.sh

Must always:

fetch dataset
clean dataset
train CRF
freeze model
package model into wheel
regenerate golden
validate tests
remove temporary artifacts

Infra maintainers must ensure:

freeze script works on clean machines
wheel contains correct model
no training artifacts leak into Git
versioning is correct

5. Maintain Makefile Orchestration¶

The Makefile is the single source of truth for:

training
testing
golden regeneration
release freeze
linting
dataset ingestion

Infra maintainers must:

keep targets deterministic
avoid side effects
ensure targets work on Linux + macOS
ensure targets do not require secrets

6. Maintain Pre‑Commit Hooks¶

Infra maintainers must ensure:

Black formatting
Ruff linting
YAML/JSON/TOML validation
model‑file blocking
golden‑ambiguity blocking

Run:

pre-commit install
pre-commit run --all-files

7. Maintain CI Stability¶

Infra maintainers must ensure:

CI runs make test
CI enforces pre‑commit
CI validates wheel build
CI validates golden regression

CI must never allow:

model drift
golden drift
missing packaged model
stale artifacts

🧪 Testing Requirements¶

Infra maintainers must ensure:

all tests pass before merging
golden regression is stable
CRFBackend loads packaged model
no test depends on local state

Run:

make test

🚫 What Infra Maintainers Must Never Do¶

never commit training models
never commit large datasets
never modify golden without explanation
never change CRF features without retraining
never bypass pre‑commit
never merge without full test pass
never break reproducibility

📦 Release Responsibilities¶

Infra maintainers own the release pipeline:

version bump
golden regeneration
freeze script execution
wheel validation
PyPI upload
post‑release verification

Release dry‑run:

make test
pip install dist/zomi_syl-*.whl --force-reinstall
python -m zomi_syl syllabify "themthum"

🧭 How to Propose Infra Changes¶

Open an issue describing:
motivation
reproducibility impact
golden impact
release impact
Create a feature branch
Update Makefile + scripts
Update tests
Run full release dry‑run
Submit PR with:
evaluation summary
golden diff
wheel validation

🤝 Thank You¶

Infra maintainers are the backbone of Zomi‑Syl.
Your work ensures the project remains:

reproducible
stable
linguistically correct
future‑proof

This is the foundation for the entire Zomi NLP ecosystem.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search