Commit 7770581d authored by Vít Novotný's avatar Vít Novotný
Browse files

Add `03_train_ner_models.ipynb`

parent abb3e082
Pipeline #146957 failed with stage
in 9 minutes and 58 seconds
%% Cell type:markdown id:089e0573-56f2-4827-a2d5-4b88c8e24e43 tags:
# Train NER models
In this notebook, we will train a number of named entity recognition (NER) models using different training schedules and training/validation datasets. Then, we will select the best model using our test dataset.
%% Cell type:markdown id:9d9fc44f-6c13-47c5-969c-8d26448d2c2d tags:
## Preliminaries
We will begin with a bit of boilerplate, logging information and setting up the computational environment.
%% Cell type:code id:e9047d58-9d3d-4123-a1ed-60a0724295dc tags:
``` python
! hostname
%% Output
%% Cell type:code id:e30f3d27-4c1c-4edf-a0be-f0febde2139b tags:
``` python
! python -V
%% Output
Python 3.8.10
%% Cell type:markdown id:e1f13f57-c900-45ed-8698-3668771d7098 tags:
Install the current version of the package and its dependencies.
%% Cell type:code id:f990803b-d7b3-4240-9b7c-16b9865a2c5d tags:
``` python
! pip install .
%% Cell type:markdown id:0587a39e-c5dd-4e52-807a-13aecbdeb5bd tags:
Make sure numpy does not parallelize.
%% Cell type:code id:0444735d-2dd0-40b0-b9a3-bbb03416f65c tags:
``` python
import os
%% Cell type:code id:3c06551f-1a14-4ec6-82c8-31c8314639bd tags:
``` python
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
%% Cell type:markdown id:87b8905f-6989-415f-b271-9c9feb56e560 tags:
Pick the GPU that we will use.
%% Cell type:code id:222cafea-1124-40f3-8156-ac6ee364e83b tags:
``` python
! nvidia-smi -L
%% Output
GPU 0: NVIDIA A40 (UUID: GPU-177e5a84-366f-6464-1bbb-908f2dd979cc)
GPU 1: Tesla T4 (UUID: GPU-cf4e7061-619f-5b3b-a217-410f6d506d62)
GPU 2: Tesla T4 (UUID: GPU-00386b4a-741a-aac4-b833-b678a811936f)
GPU 3: Tesla T4 (UUID: GPU-10531c8c-13c3-8e82-302b-91a5615701d6)
GPU 4: Tesla T4 (UUID: GPU-82eac985-cf18-1379-cbcc-e8d71246e28c)
GPU 5: Tesla T4 (UUID: GPU-552f5db8-cec9-3733-3394-17c1ecbc8b85)
GPU 6: Tesla T4 (UUID: GPU-7d2ad51d-6c12-c878-1a30-a21a7fe9c7bd)
GPU 7: Tesla T4 (UUID: GPU-81bd2022-c6f6-4a67-d3f3-f461591e20ab)
GPU 8: Tesla T4 (UUID: GPU-4f6616fb-96e0-adbd-6ee5-7b6146de8ece)
GPU 9: Tesla T4 (UUID: GPU-197d3f17-6807-d6d8-a31c-f54ef78bcb2d)
GPU 10: Tesla T4 (UUID: GPU-e36ec7af-fa51-2498-6bb9-1f2e57bed4c5)
GPU 11: NVIDIA A100 80GB PCIe (UUID: GPU-2d25d82d-c487-73b0-9341-82e74253106e)
GPU 12: Tesla T4 (UUID: GPU-4195d034-0e80-bd51-3c68-3069d48177db)
GPU 13: Tesla T4 (UUID: GPU-030e587b-ae70-3854-4a86-b888f04de428)
GPU 14: Tesla T4 (UUID: GPU-c450823e-5524-7032-228b-140b3187d733)
GPU 15: Tesla T4 (UUID: GPU-8b6ef8ec-186a-2e88-d308-569892e57eeb)
GPU 16: Tesla T4 (UUID: GPU-7edb1e91-a5cb-40a4-b470-e1548a76e6d9)
%% Cell type:code id:4b62818d-125a-46f8-80da-ecdc1bead095 tags:
``` python
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "14"
%% Cell type:markdown id:425fbbe4-b88a-420e-9897-ec861cdf111c tags:
Set up logging to display informational messages.
%% Cell type:code id:e271a6d3-4757-4020-a928-20c6406bd26d tags:
``` python
import logging
import sys
%% Cell type:code id:cf472e29-276b-4202-a975-d63f1b9c28aa tags:
``` python
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(message)s')
%% Cell type:markdown id:6c8a96f9-c6f1-453a-a93c-d133c0d437f6 tags:
## Train models
To train our models, we will use two different schedules and four different types of datasets from two different methods for finding named entities. In total, we will train 16 different NER models.
%% Cell type:markdown id:ec4926c7-5477-46c5-9d27-01b64179bcb4 tags:
We will fine-tune [a pretrained `xlm-roberta-base` model][1] with the following two schedules for our masked language modeling (MLM) and named entity recognition (NER) objectives:
- First with MLM for at most 5 epochs and then with NER for at most 5 epochs.
- Using both MLM and NER in parallel for at most 10 epochs.
%% Cell type:code id:f23106ca-bf03-4c27-9548-927aa01d89a4 tags:
``` python
schedule_names = ['fine-tuning', 'parallel']
%% Cell type:markdown id:dd602ab5-83a9-4924-9a37-cf3b521cdd92 tags:
We will use datasets produced from the results of two search methods:
- Fuzzy regexes
- Manatee
%% Cell type:code id:36a4c3a2-56e3-439c-bd6c-34f68fb9fb4c tags:
``` python
search_methods = ['manatee', 'fuzzy-regex']
%% Cell type:markdown id:7e638751-ea86-481a-bb41-7306b0c56445 tags:
For both Manatee and fuzzy regexes, we will use four different datasets of different sizes and different quality of annotations:
- Using all results from all documents.
- Using all results from documents that have been marked as relevant by expert annotators.
- Using results from sentences that don't cross document boundaries from all documents.
- Using results from sentences that don't cross document boundaries from documents that have been marked as relevant by expert annotators.
%% Cell type:code id:15232a72-22de-488d-8446-77d655d80a66 tags:
``` python
cross_page_boundaries_values = ['non-crossing', 'all']
only_relevant_values = ['only-relevant', 'all']
%% Cell type:markdown id:e1aca5ae-a2a7-481f-9b92-3f0e975bbac3 tags:
We will train all our models in turn:
%% Cell type:code id:dd1ab471-88d2-4025-aa5f-9fc94a29217f tags:
``` python
from itertools import product
%% Cell type:code id:00e13fff-2a8b-4d5a-a014-65dbc0055b4d tags:
``` python
from ahisto_named_entity_search.recognition import NerModel
%% Cell type:code id:0b8cb413-c70b-48f8-b6c7-d43a944ee60c tags:
``` python
models = []
for schedule_name, only_relevant, search_method, cross_page_boundaries in product(
schedule_names, only_relevant_values, search_methods, cross_page_boundaries_values):
model_basename = f'model_ner_{search_method}_{cross_page_boundaries}_{only_relevant}_{schedule_name}'
model_checkpoint_basename = f'{model_basename}_checkpoints'
sentence_basename = f'dataset_mlm_{cross_page_boundaries}_{only_relevant}'
training_sentence_basename = f'{sentence_basename}_training'
validation_sentence_basename = f'{sentence_basename}_validation'
tagged_sentence_basename = f'dataset_ner_{search_method}_{cross_page_boundaries}_{only_relevant}'
training_tagged_sentence_basename = f'{tagged_sentence_basename}_training'
validation_tagged_sentence_basename = f'{tagged_sentence_basename}_validation'
model = NerModel.load(model_basename)
model.model # Try actually loading the NER model
except EnvironmentError:
project_name = f'AHISTO NER: {search_method}, {cross_page_boundaries}, {only_relevant}'
os.environ['COMET_PROJECT_NAME'] = project_name
model = NerModel.train_and_save(model_checkpoint_basename, model_basename,
training_sentence_basename, validation_sentence_basename,
validation_tagged_sentence_basename, schedule_name)
%% Cell type:markdown id:c9a695c8-4585-4af5-9a99-544a3e340cd3 tags:
## Evaluate NER models
%% Cell type:code id:984f0f20-72f4-4293-b5ad-2065c4eb6806 tags:
``` python
testing_tagged_sentence_basename = 'dataset_ner_manatee_non-crossing_only-relevant_training'
%% Cell type:code id:565d4512-c7b6-47be-a6e2-c487b1043ef7 tags:
``` python
%% Output
NerModel: /nlp/projekty/ahisto/public_html/named-entity-search/results/model_ner_manatee_non-crossing_only-relevant_fine-tuning/TokenClassification
%% Cell type:code id:eba26620-5a9f-4a41-aef1-1b80e280cd2a tags:
``` python
f_score = models[0].test(testing_tagged_sentence_basename)
print(f'Mean F-score: {f_score * 100.0:.2f}%')
%% Output
Mean F-score: 34.26%
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment