Commit 224e461d authored by Vít Novotný's avatar Vít Novotný
Browse files

Evaluate `model_ner_fuzzy-regex_non-crossing_all_fine-tuning`

parent 3063e29d
Pipeline #147394 passed with stage
in 9 minutes and 20 seconds
%% Cell type:markdown id:089e0573-56f2-4827-a2d5-4b88c8e24e43 tags:
# Train NER models
In this notebook, we will train a number of named entity recognition (NER) models using different training schedules and training/validation datasets. Then, we will select the best model using our test dataset.
%% Cell type:markdown id:9d9fc44f-6c13-47c5-969c-8d26448d2c2d tags:
## Preliminaries
We will begin with a bit of boilerplate, logging information and setting up the computational environment.
%% Cell type:code id:e9047d58-9d3d-4123-a1ed-60a0724295dc tags:
``` python
! hostname
```
%% Output
apollo.fi.muni.cz
%% Cell type:code id:e30f3d27-4c1c-4edf-a0be-f0febde2139b tags:
``` python
! python -V
```
%% Output
Python 3.8.10
%% Cell type:markdown id:e1f13f57-c900-45ed-8698-3668771d7098 tags:
Install the current version of the package and its dependencies.
%% Cell type:code id:f990803b-d7b3-4240-9b7c-16b9865a2c5d tags:
``` python
%%capture
! pip install .
```
%% Cell type:markdown id:0587a39e-c5dd-4e52-807a-13aecbdeb5bd tags:
Make sure numpy does not parallelize.
%% Cell type:code id:0444735d-2dd0-40b0-b9a3-bbb03416f65c tags:
``` python
import os
```
%% Cell type:code id:3c06551f-1a14-4ec6-82c8-31c8314639bd tags:
``` python
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
```
%% Cell type:markdown id:87b8905f-6989-415f-b271-9c9feb56e560 tags:
Pick the GPU that we will use.
%% Cell type:code id:222cafea-1124-40f3-8156-ac6ee364e83b tags:
``` python
! nvidia-smi -L
```
%% Output
GPU 0: NVIDIA A40 (UUID: GPU-177e5a84-366f-6464-1bbb-908f2dd979cc)
GPU 1: Tesla T4 (UUID: GPU-cf4e7061-619f-5b3b-a217-410f6d506d62)
GPU 2: Tesla T4 (UUID: GPU-00386b4a-741a-aac4-b833-b678a811936f)
GPU 3: Tesla T4 (UUID: GPU-10531c8c-13c3-8e82-302b-91a5615701d6)
GPU 4: Tesla T4 (UUID: GPU-82eac985-cf18-1379-cbcc-e8d71246e28c)
GPU 5: Tesla T4 (UUID: GPU-552f5db8-cec9-3733-3394-17c1ecbc8b85)
GPU 6: Tesla T4 (UUID: GPU-7d2ad51d-6c12-c878-1a30-a21a7fe9c7bd)
GPU 7: Tesla T4 (UUID: GPU-81bd2022-c6f6-4a67-d3f3-f461591e20ab)
GPU 8: Tesla T4 (UUID: GPU-4f6616fb-96e0-adbd-6ee5-7b6146de8ece)
GPU 9: Tesla T4 (UUID: GPU-197d3f17-6807-d6d8-a31c-f54ef78bcb2d)
GPU 10: Tesla T4 (UUID: GPU-e36ec7af-fa51-2498-6bb9-1f2e57bed4c5)
GPU 11: NVIDIA A100 80GB PCIe (UUID: GPU-2d25d82d-c487-73b0-9341-82e74253106e)
GPU 12: Tesla T4 (UUID: GPU-4195d034-0e80-bd51-3c68-3069d48177db)
GPU 13: Tesla T4 (UUID: GPU-030e587b-ae70-3854-4a86-b888f04de428)
GPU 14: Tesla T4 (UUID: GPU-c450823e-5524-7032-228b-140b3187d733)
GPU 15: Tesla T4 (UUID: GPU-8b6ef8ec-186a-2e88-d308-569892e57eeb)
GPU 16: Tesla T4 (UUID: GPU-7edb1e91-a5cb-40a4-b470-e1548a76e6d9)
%% Cell type:code id:4b62818d-125a-46f8-80da-ecdc1bead095 tags:
``` python
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "14"
```
%% Cell type:markdown id:425fbbe4-b88a-420e-9897-ec861cdf111c tags:
Set up logging to display informational messages.
%% Cell type:code id:e271a6d3-4757-4020-a928-20c6406bd26d tags:
``` python
import logging
import sys
```
%% Cell type:code id:cf472e29-276b-4202-a975-d63f1b9c28aa tags:
``` python
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(message)s')
```
%% Cell type:markdown id:205c4a1b-152b-4ccf-878b-1dc73fb1151c tags:
## Load documents and entities
First, we will load all documents and entities.
%% Cell type:code id:e6c69def-7da0-44de-b51b-7533832cfb94 tags:
``` python
from ahisto_named_entity_search.document import Document, load_documents
```
%% Cell type:code id:92a33ca9-9ae7-41da-b1c1-a5ea5e54a032 tags:
``` python
documents = load_documents()
```
%% Output
Loading documents: 100%|████████████████████████████████████████████████████████████████| 268669/268669 [00:06<00:00, 43759.69it/s]
Loading documents: 100%|████████████████████████████████████████████████████████████████| 268669/268669 [00:06<00:00, 41933.74it/s]
%% Cell type:markdown id:6c8a96f9-c6f1-453a-a93c-d133c0d437f6 tags:
## Train models
To train our models, we will use two different schedules and four different types of datasets from two different methods for finding named entities. In total, we will train 16 different NER models.
%% Cell type:markdown id:ec4926c7-5477-46c5-9d27-01b64179bcb4 tags:
We will fine-tune [a pretrained `xlm-roberta-base` model][1] with the following two schedules for our masked language modeling (MLM) and named entity recognition (NER) objectives:
- First with MLM for at most 5 epochs and then with NER for at most 5 epochs.
- Using both MLM and NER in parallel for at most 10 epochs.
[1]: https://huggingface.co/xlm-roberta-base
%% Cell type:code id:f23106ca-bf03-4c27-9548-927aa01d89a4 tags:
``` python
schedule_names = ['fine-tuning', 'parallel']
```
%% Cell type:markdown id:dd602ab5-83a9-4924-9a37-cf3b521cdd92 tags:
We will use datasets produced from the results of two search methods:
- Fuzzy regexes
- Manatee
%% Cell type:code id:36a4c3a2-56e3-439c-bd6c-34f68fb9fb4c tags:
``` python
search_methods = ['manatee', 'fuzzy-regex']
```
%% Cell type:markdown id:7e638751-ea86-481a-bb41-7306b0c56445 tags:
For both Manatee and fuzzy regexes, we will use four different datasets of different sizes and different quality of annotations:
- Using all results from all documents.
- Using all results from documents that have been marked as relevant by expert annotators.
- Using results from sentences that don't cross document boundaries from all documents.
- Using results from sentences that don't cross document boundaries from documents that have been marked as relevant by expert annotators.
%% Cell type:code id:15232a72-22de-488d-8446-77d655d80a66 tags:
``` python
cross_page_boundaries_values = ['non-crossing', 'all']
only_relevant_values = ['only-relevant', 'all']
```
%% Cell type:markdown id:e1aca5ae-a2a7-481f-9b92-3f0e975bbac3 tags:
We will train all our models in turn:
%% Cell type:code id:dd1ab471-88d2-4025-aa5f-9fc94a29217f tags:
``` python
from itertools import product
```
%% Cell type:code id:00e13fff-2a8b-4d5a-a014-65dbc0055b4d tags:
``` python
from ahisto_named_entity_search.recognition import NerModel
```
%% Cell type:code id:0b8cb413-c70b-48f8-b6c7-d43a944ee60c tags:
``` python
models = []
for schedule_name, only_relevant, search_method, cross_page_boundaries in product(
schedule_names, only_relevant_values, search_methods, cross_page_boundaries_values):
if schedule_name == 'fine-tuning' and search_method == 'fuzzy-regex' and cross_page_boundaries == 'non-crossing' and only_relevant == 'all':
continue # TODO: remove me after the fuzzy-regex-non-crossing-all has finished training
model_basename = f'model_ner_{search_method}_{cross_page_boundaries}_{only_relevant}_{schedule_name}'
model_checkpoint_basename = f'{model_basename}_checkpoints'
sentence_basename = f'dataset_mlm_{cross_page_boundaries}_{only_relevant}'
training_sentence_basename = f'{sentence_basename}_training'
validation_sentence_basename = f'{sentence_basename}_validation'
tagged_sentence_basename = f'dataset_ner_{search_method}_{cross_page_boundaries}_{only_relevant}'
training_tagged_sentence_basename = f'{tagged_sentence_basename}_training'
validation_tagged_sentence_basename = f'{tagged_sentence_basename}_validation'
try:
model = NerModel.load(model_basename)
# model.model # Try actually loading the NER model # TODO: uncomment
except EnvironmentError:
project_name = f'AHISTO NER: {search_method}, {cross_page_boundaries}, {only_relevant}'
os.environ['COMET_PROJECT_NAME'] = project_name
model = NerModel.train_and_save(model_checkpoint_basename, model_basename,
training_sentence_basename, validation_sentence_basename,
training_tagged_sentence_basename,
validation_tagged_sentence_basename, schedule_name)
models.append(model)
```
%% Cell type:markdown id:c9a695c8-4585-4af5-9a99-544a3e340cd3 tags:
## Evaluate NER models
### Quantitative evaluation
To evaluate our models, we will use our smallest (and therefore highest-grade) test dataset.
%% Cell type:code id:984f0f20-72f4-4293-b5ad-2065c4eb6806 tags:
``` python
testing_tagged_sentence_basename = 'dataset_ner_manatee_non-crossing_only-relevant_testing'
```
%% Cell type:markdown id:310a1802-2a50-4dfe-b9ec-60e2dcd5e0ff tags:
For each model, we will compute the $F_\beta$-score ($\beta = 0.25$) at our test dataset.
%% Cell type:code id:e4ff7558-61ed-4208-95b3-2e42a7b6533f tags:
``` python
evaluated_models = list()
f_scores_dict = dict()
evaluators = None
def evaluate_model(model: NerModel):
global evaluators
test_result = model.test(testing_tagged_sentence_basename)
if evaluators is None:
evaluators = list(test_result.keys())
evaluated_models.append(model)
f_scores_dict[model] = [test_result[evaluator] for evaluator in evaluators]
```
%% Cell type:code id:68b59876-9899-442b-9f0a-aa92e2660269 tags:
``` python
for model in models:
evaluate_model(model)
```
%% Cell type:markdown id:aa30c995-d336-4196-98fb-29c69dd9ad20 tags:
In the evaluation, we will also include the `Babelscape/wikineural-multilingual-ner` baseline model.
%% Cell type:code id:694daad3-2b04-4e3f-8bfb-bb3fe0c87dd3 tags:
``` python
baseline_model = NerModel('Babelscape/wikineural-multilingual-ner',
labels=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'])
```
%% Cell type:code id:fed4d0a4-5bc4-4af2-8e1b-c5a8a6b61c52 tags:
``` python
evaluate_model(baseline_model)
```
%% Cell type:markdown id:b9b1bfe4-9a0b-43f8-8cbf-9c101f03202e tags:
Finally, we will plot the evaluation results to a table.
%% Cell type:code id:38efa732-8afd-4798-809a-ca828a8b960c tags:
``` python
from pathlib import Path
```
%% Cell type:code id:1c9d3bab-53de-4c39-8e4d-cd2978f00925 tags:
``` python
from IPython.display import display
import pandas as pd
from pandas import DataFrame
```
%% Cell type:code id:f47f62ca-1164-45b5-94b9-ff082787e8a9 tags:
``` python
rows = [f'{model} baseline' if model == baseline_model else Path(str(model)).parent.name for model in evaluated_models]
columns = [str(evaluator) for evaluator in evaluators]
data = [f_scores_dict[model] for model in evaluated_models]
f_scores_df = DataFrame(data, columns=columns, index=rows)
```
%% Cell type:code id:053ff5dd-775c-431d-a988-18bf3c4f4f6d tags:
``` python
with pd.option_context('display.float_format', lambda mean_f_score: f'{100.0 * mean_f_score:.5f}%'):
display(f_scores_df.sort_values(by=['PER+LOC'], ascending=False))
```
%% Output
%% Cell type:markdown id:488c223d-4269-4261-baf3-1d5980ef6ea8 tags:
Based on the evaluation results, we will select our best model.
%% Cell type:code id:f7975e45-ba27-45b4-9b61-1a9c119d434d tags:
``` python
all_evaluator_index, = [index for index, evaluator in enumerate(evaluators) if str(evaluator) == 'PER+LOC']
best_model, _ = max(f_scores_dict.items(), key=lambda x: (x[1][all_evaluator_index], x[0]))
print(best_model)
```
%% Output
/nlp/projekty/ahisto/public_html/named-entity-search/results/model_ner_manatee_all_only-relevant_fine-tuning/TokenClassification
%% Cell type:markdown id:8cd67749-9c40-4f9d-a889-5f12dc0c380d tags:
### Qualitative evaluation
We will use our best model to recognize entities in an example sentence.
%% Cell type:code id:fffd0beb-ac50-4c81-9eb3-cc225214ff63 tags:
``` python
example_document = documents['386/14']
example_snippet = (
'Ještě příznivěji by se nám objevila tato ukázka literární tvorby slovenské, kdybychom měli na zřeteli literární památky sourodé.',
'Nejstarší listina vůbec naším národním jazykem psaná je smlouva mezi Petrem Neumburgerem a panem Bočkem z Kunštátu, sepsaná '
'v Poděbradech 17. prosince 1370.',
'Nejstarší listina moravská je zápis markrabí Jošta moravského jeho bratru Prokopovi ze dne 17. března 1389.',
)
```
%% Cell type:code id:b83d2a1a-4d8f-400a-a8e8-cba43fe41a83 tags:
``` python
example_sentence_start, example_sentence_end = example_document.find_snippet(example_snippet)
example_sentence = example_document[example_sentence_start:example_sentence_end]
print(example_sentence)
```
%% Output
Nejstarší listina vůbec naším národním jazykem psaná je smlouva mezi Petrem Neumburgerem a panem Bočkem z Kunštátu, sepsaná v Poděbradech 17. prosince 1370
%% Cell type:markdown id:75701abf-d1ed-4547-b701-918d004ebd4a tags:
Here are the labels predicted for our example sentence by our best model:
%% Cell type:code id:de04e6a9-33e5-4e85-9cfc-f0f5bc344677 tags:
``` python
from transformers import pipeline
```
%% Cell type:code id:d11a74e1-f3c8-416c-897e-e921a69dc661 tags:
``` python
def tag_sentence(model: NerModel, sentence: str) -> None:
named_entity_recognizer = pipeline('ner', model=model.model, tokenizer=model.tokenizer)
labels = named_entity_recognizer(sentence)
for label in labels:
print(f'- {label["entity"]}: {label["word"]}')
```
%% Cell type:code id:675dd306-50b0-4245-9904-effff2432921 tags:
``` python
tag_sentence(best_model, example_sentence)
```
%% Output
- B-PER: ▁Bo
- B-PER: čke
- B-PER: m
- I-PER: ▁z
- I-PER: ▁Kun
- I-PER: št
- I-PER: átu
- I-PER: ,
%% Cell type:markdown id:5177b547-c2de-4a40-999b-cd612dec9010 tags:
For comparison, here are the labels predicted for our example sentence by our baseline model:
%% Cell type:code id:2b1c528d-b4ec-4e96-8ca7-ff05fd244d80 tags:
``` python
tag_sentence(baseline_model, example_sentence)
```
%% Output
- B-PER: Petr
- I-PER: ##em
- I-PER: Neu
- I-PER: ##mburg
- I-PER: ##ere
- I-PER: ##m
- B-PER: Bo
- I-PER: ##čke
- I-PER: ##m
- I-PER: z
- I-PER: Kun
- I-PER: ##š
- I-PER: ##tát
- I-PER: ##u
- B-LOC: Pod
- I-LOC: ##ě
- I-LOC: ##bra
- I-LOC: ##de
- I-LOC: ##ch
TokenClassification: 100%|██████████████████████████████████| 2322/2322 [01:01<00:00, 55.97batches/s, epoch=0, loss=-1, split=eval]
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment