⭐ 1. Do all backends have a tagset?¶
Every backend will have a tagset, but not in the same way and not for the same purpose.
✔ CRF backend¶
Yes — CRF must have a tagset because it predicts BIO tags:
["B", "I"]
This is part of the model definition.
✔ Rule backend¶
No — rule backend does not use tags.
But it does have a “feature set” (onsets, nuclei, codas, rules).
✔ FST backend¶
No tagset — but it has states, transitions, etc.
✔ Transformer backend¶
No tagset — but it has layers, heads, hidden size.
✔ BiLSTM backend¶
If it predicts BIO tags, then yes — it will have a tagset.
⭐ 2. So how do we unify metadata across all backends?¶
We do NOT force every backend to have a tagset.
Instead:
- CRF backend →
features.tagset - Rule backend →
features.ruleset_summary - FST backend →
features.fst_states - Transformer backend →
features.num_layers, etc.
This is exactly why UMS has:
"features": { ... backend-specific ... }
Each backend fills in only what applies.
⭐ 3. CRF backend implementation (UMS‑compliant)¶
Here is the correct UMS implementation for CRF backend, using existing interface.
✔ Add backend identity¶
✔ Add feature metadata¶
✔ Add UMS wrapper¶
✔ Keep existing capabilities block¶
class CRFBackend(BaseSyllabifier):
backend_name = "crf"
backend_type = "statistical"
backend_version = "1.0.0"
def _feature_metadata(self):
return {
"tagset": ["B", "I"],
"num_features": len(self.feature_templates),
"feature_templates": self.feature_templates,
"model_path": str(self.model_path),
}
def get_metadata(self) -> Dict[str, Any]:
return {
"backend_type": "crf",
"version": self.backend_version,
"capabilities": {
"supports_confidence": True,
"supports_batch": True,
"supports_gpu": False,
"supports_cpu": True,
},
"ums": self._base_ums(),
}
This is fully correct and matches backend interface.
⭐ 4. Why this is the correct design¶
✔ CRF backend has a tagset¶
Because it predicts BIO tags.
✔ Rule backend does NOT have a tagset¶
Because it does not predict tags.
✔ UMS allows backend‑specific features¶
Without forcing irrelevant fields.
✔ include_metadata=True now matters¶
Because metadata is only attached when requested.
⭐ 5. What the CRF metadata will look like¶
When a user calls:
zs.syllabify("itna", model="crf", return_metadata=True)
They will see:
raw["metadata"] = {
"backend_type": "crf",
"version": "1.0.0",
"capabilities": {...},
"ums": {
"backend": {...},
"model": {...},
"features": {
"tagset": ["B", "I"],
"num_features": 128,
"feature_templates": [...],
"model_path": ".../crf_syllabifier.joblib"
},
"runtime": {...}
}
}