# Train NER models
# Train NER models
In this notebook, we will train a number of named entity recognition (NER) models using different training schedules and training/validation datasets. Then, we will select the best model using our test dataset.
## Preliminaries
## Preliminaries
We will begin with a bit of boilerplate, logging information and setting up the computational environment.
``` python
``` python
! hostname
%% Output
``` python
``` python
! python -V
%% Output
Python 3.8.10
Install the current version of the package and its dependencies.
Install the current version of the package and its dependencies.
``` python
``` python
! pip install .
Make sure numpy does not parallelize.
Make sure numpy does not parallelize.
``` python
import os
``` python
import os
%% Cell type:code id:3c06551f-1a14-4ec6-82c8-31c8314639bd tags:
``` python
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
Pick the GPU that we will use.
Pick the GPU that we will use.
``` python
! nvidia-smi -L
``` python
! nvidia-smi -L
%% Output
GPU 0: NVIDIA A40 (UUID: GPU-177e5a84-366f-6464-1bbb-908f2dd979cc)
GPU 1: Tesla T4 (UUID: GPU-cf4e7061-619f-5b3b-a217-410f6d506d62)
GPU 2: Tesla T4 (UUID: GPU-00386b4a-741a-aac4-b833-b678a811936f)
GPU 3: Tesla T4 (UUID: GPU-10531c8c-13c3-8e82-302b-91a5615701d6)
GPU 4: Tesla T4 (UUID: GPU-82eac985-cf18-1379-cbcc-e8d71246e28c)
GPU 5: Tesla T4 (UUID: GPU-552f5db8-cec9-3733-3394-17c1ecbc8b85)
GPU 6: Tesla T4 (UUID: GPU-7d2ad51d-6c12-c878-1a30-a21a7fe9c7bd)
GPU 7: Tesla T4 (UUID: GPU-81bd2022-c6f6-4a67-d3f3-f461591e20ab)
GPU 8: Tesla T4 (UUID: GPU-4f6616fb-96e0-adbd-6ee5-7b6146de8ece)
GPU 9: Tesla T4 (UUID: GPU-197d3f17-6807-d6d8-a31c-f54ef78bcb2d)
GPU 10: Tesla T4 (UUID: GPU-e36ec7af-fa51-2498-6bb9-1f2e57bed4c5)
GPU 11: NVIDIA A100 80GB PCIe (UUID: GPU-2d25d82d-c487-73b0-9341-82e74253106e)
GPU 12: Tesla T4 (UUID: GPU-4195d034-0e80-bd51-3c68-3069d48177db)
GPU 13: Tesla T4 (UUID: GPU-030e587b-ae70-3854-4a86-b888f04de428)
GPU 14: Tesla T4 (UUID: GPU-c450823e-5524-7032-228b-140b3187d733)
GPU 15: Tesla T4 (UUID: GPU-8b6ef8ec-186a-2e88-d308-569892e57eeb)
GPU 16: Tesla T4 (UUID: GPU-7edb1e91-a5cb-40a4-b470-e1548a76e6d9)
%% Cell type:code id:4b62818d-125a-46f8-80da-ecdc1bead095 tags:
``` python
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "14"
Set up logging to display informational messages.
Set up logging to display informational messages.
``` python
import logging
import sys
``` python
import logging
import sys
``` python
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(message)s')
``` python
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(message)s')
## Train models
## Train models
To train our models, we will use two different schedules and four different types of datasets from two different methods for finding named entities. In total, we will train 16 different NER models.
%% Cell type:markdown id:ec4926c7-5477-46c5-9d27-01b64179bcb4 tags:
We will fine-tune [a pretrained `xlm-roberta-base` model][1] with the following two schedules for our masked language modeling (MLM) and named entity recognition (NER) objectives:
- First with MLM for at most 5 epochs and then with NER for at most 5 epochs.
- Using both MLM and NER in parallel for at most 10 epochs.
``` python
schedule_names = ['fine-tuning', 'parallel']
``` python
schedule_names = ['fine-tuning', 'parallel']
%% Cell type:markdown id:dd602ab5-83a9-4924-9a37-cf3b521cdd92 tags:
We will use datasets produced from the results of two search methods:
- Fuzzy regexes
- Manatee
``` python
search_methods = ['manatee', 'fuzzy-regex']
``` python
search_methods = ['manatee', 'fuzzy-regex']
%% Cell type:markdown id:7e638751-ea86-481a-bb41-7306b0c56445 tags:
For both Manatee and fuzzy regexes, we will use four different datasets of different sizes and different quality of annotations:
- Using all results from all documents.
- Using all results from documents that have been marked as relevant by expert annotators.
- Using results from sentences that don't cross document boundaries from all documents.
- Using results from sentences that don't cross document boundaries from documents that have been marked as relevant by expert annotators.
%% Cell type:code id:15232a72-22de-488d-8446-77d655d80a66 tags:
``` python
cross_page_boundaries_values = ['non-crossing', 'all']
only_relevant_values = ['only-relevant', 'all']
%% Cell type:markdown id:e1aca5ae-a2a7-481f-9b92-3f0e975bbac3 tags:
We will train all our models in turn:
``` python
from itertools import product
``` python
from itertools import product
``` python
from ahisto_named_entity_search.recognition import NerModel
``` python
from ahisto_named_entity_search.recognition import NerModel
%% Cell type:code id:0b8cb413-c70b-48f8-b6c7-d43a944ee60c tags:
``` python
models = []
for schedule_name, only_relevant, search_method, cross_page_boundaries in product(
schedule_names, only_relevant_values, search_methods, cross_page_boundaries_values):
model_basename = f'model_ner_{search_method}_{cross_page_boundaries}_{only_relevant}_{schedule_name}'
model_checkpoint_basename = f'{model_basename}_checkpoints'
sentence_basename = f'dataset_mlm_{cross_page_boundaries}_{only_relevant}'
training_sentence_basename = f'{sentence_basename}_training'
validation_sentence_basename = f'{sentence_basename}_validation'
tagged_sentence_basename = f'dataset_ner_{search_method}_{cross_page_boundaries}_{only_relevant}'
training_tagged_sentence_basename = f'{tagged_sentence_basename}_training'
validation_tagged_sentence_basename = f'{tagged_sentence_basename}_validation'
model = NerModel.load(model_basename)
model.model # Try actually loading the NER model
except EnvironmentError:
project_name = f'AHISTO NER: {search_method}, {cross_page_boundaries}, {only_relevant}'
os.environ['COMET_PROJECT_NAME'] = project_name
model = NerModel.train_and_save(model_checkpoint_basename, model_basename,
training_sentence_basename, validation_sentence_basename,
validation_tagged_sentence_basename, schedule_name)
## Evaluate NER models
## Evaluate NER models
%% Cell type:code id:984f0f20-72f4-4293-b5ad-2065c4eb6806 tags:
``` python
testing_tagged_sentence_basename = 'dataset_ner_manatee_non-crossing_only-relevant_training'
``` python
models[0]
``` python
%% Output
NerModel: /nlp/projekty/ahisto/public_html/named-entity-search/results/model_ner_manatee_non-crossing_only-relevant_fine-tuning/TokenClassification
%% Cell type:code id:eba26620-5a9f-4a41-aef1-1b80e280cd2a tags:
``` python
f_score = models[0].test(testing_tagged_sentence_basename)
print(f'Mean F-score: {f_score * 100.0:.2f}%')
%% Output
Mean F-score: 34.26%
