ruBERT-ruLaw-NER-judicial
Named Entity Recognition model for Russian judicial decisions. Fine-tuned on a manually annotated corpus of 1335 sentence fragments drawn from administrative, civil, and criminal cases, covering six entity classes.
Base model: TryDotAtwo/ruBERT-ruLaw
- ruBERT with domain-adaptive pre-training (DAPT) on the RusLawOD corpus of Russian normative legal acts.
This model was developed as part of a bachelor's thesis. Full code, dataset, and experiments: github.com/Bishop-Y/Bachelor_Thesis
Performance
Metrics below are for this single best checkpoint on the held-out test set
(201 sentence fragments).
The primary metric is strict span-level F1: a prediction counts as a true positive
only on an exact (start, end, label) match - any boundary shift or label mismatch
is an error.
| Class | F1 |
|---|---|
| DATE | 0.917 |
| LAW | 0.966 |
| ORG | 0.901 |
| PENALTY | 0.807 |
| PERSON | 0.977 |
| PROVISION | 0.950 |
| Macro F1 | 0.920 |
Training data
The corpus was built on top of court-decision texts from the RuLegalNER dataset [Shaheen et al., 2023].
Splits (document-level, stratified by case type and rarest penalty subtype): 1000 / 134 / 201 fragments (train / val / test).
Limitations
- Annotated by a single annotator.
- Built on a single corpus of court decisions - generalization to other legal document types (contracts, claims, etc.) is not guaranteed.
- Cased model: performance degrades under case perturbation of the input.
PENALTYis the weakest and most data-hungry class, especially on rare subtypes.
Citation
The dataset is based on:
@article{shaheen2023rulegalner,
author = {Zein Shaheen and Dmitry I. Mouromtsev and Ignat Postny},
title = {RuLegalNER: A New Dataset for Russian Legal Named Entities Recognition},
journal = {Scientific and Technical Journal of Information Technologies, Mechanics and Optics},
volume = {23},
number = {4},
pages = {854--857},
year = {2023}
}
- Downloads last month
- 68