In this notebook, we will produce a dataset for traning, validating, and testing a named entitity recognition model. We will do this by finding occurences of all entities from regests in all OCR documents.
In this notebook, we will produce a dataset for training, validating, and testing a model with masked language modeling (MLM) and named entity recognition (NER) objectives. We will do this by finding occurences of all entities from regests in all OCR documents.
The annotators will filter out non-entities, and will provide types (person or place) and canonical forms of the remaining entities by fixing the most jarring typos.
The search takes a while. In parallel to the search, we will produce a confirmatory spreadsheet with pre-filtered canonical entities. We will pass the spreadsheet to expert annotators.
Finally, we will create the datasets for for training, validating, and testing a model with masked language modeling (MLM) and named entity recognition (NER) objectives.
Having the confirmed canonical entities and search results, we will produce a dataset for training, validating, and testing a named entity recognition model.