"In this notebook, we will train a number of named entity recognition (NER) models using different training schedules and training/validation datasets. Then, we will select the best model using our test dataset."
]
},
{
"cell_type": "markdown",
"id": "9d9fc44f-6c13-47c5-969c-8d26448d2c2d",
"metadata": {},
"source": [
"## Preliminaries\n",
"\n",
"We will begin with a bit of boilerplate, logging information and setting up the computational environment."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e9047d58-9d3d-4123-a1ed-60a0724295dc",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"apollo.fi.muni.cz\n"
]
}
],
"source": [
"! hostname"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "e30f3d27-4c1c-4edf-a0be-f0febde2139b",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Python 3.8.10\n"
]
}
],
"source": [
"! python -V"
]
},
{
"cell_type": "markdown",
"id": "e1f13f57-c900-45ed-8698-3668771d7098",
"metadata": {},
"source": [
"Install the current version of the package and its dependencies."
"To train our models, we will use two different schedules and four different types of datasets from two different methods for finding named entities. In total, we will train 16 different NER models."
]
},
{
"cell_type": "markdown",
"id": "ec4926c7-5477-46c5-9d27-01b64179bcb4",
"metadata": {},
"source": [
"We will fine-tune [a pretrained `xlm-roberta-base` model][1] with the following two schedules for our masked language modeling (MLM) and named entity recognition (NER) objectives:\n",
"\n",
"- First with MLM for at most 5 epochs and then with NER for at most 5 epochs.\n",
"- Using both MLM and NER in parallel for at most 10 epochs.\n",
"\n",
" [1]: https://huggingface.co/xlm-roberta-base"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "f23106ca-bf03-4c27-9548-927aa01d89a4",
"metadata": {},
"outputs": [],
"source": [
"schedule_names = ['fine-tuning', 'parallel']"
]
},
{
"cell_type": "markdown",
"id": "dd602ab5-83a9-4924-9a37-cf3b521cdd92",
"metadata": {},
"source": [
"We will use datasets produced from the results of two search methods:\n",
"\n",
"- Fuzzy regexes\n",
"- Manatee"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "36a4c3a2-56e3-439c-bd6c-34f68fb9fb4c",
"metadata": {},
"outputs": [],
"source": [
"search_methods = ['manatee', 'fuzzy-regex']"
]
},
{
"cell_type": "markdown",
"id": "7e638751-ea86-481a-bb41-7306b0c56445",
"metadata": {},
"source": [
"For both Manatee and fuzzy regexes, we will use four different datasets of different sizes and different quality of annotations:\n",
"\n",
"- Using all results from all documents.\n",
"- Using all results from documents that have been marked as relevant by expert annotators.\n",
"- Using results from sentences that don't cross document boundaries from all documents.\n",
"- Using results from sentences that don't cross document boundaries from documents that have been marked as relevant by expert annotators."
In this notebook, we will train a number of named entity recognition (NER) models using different training schedules and training/validation datasets. Then, we will select the best model using our test dataset.
To train our models, we will use two different schedules and four different types of datasets from two different methods for finding named entities. In total, we will train 16 different NER models.
We will fine-tune [a pretrained `xlm-roberta-base` model][1] with the following two schedules for our masked language modeling (MLM) and named entity recognition (NER) objectives:
- First with MLM for at most 5 epochs and then with NER for at most 5 epochs.
- Using both MLM and NER in parallel for at most 10 epochs.