Commit 47e65c62 authored by Vít Novotný's avatar Vít Novotný
Browse files

Update `*.ipynb`

parent 27e52d2c
Pipeline #146939 failed with stage
in 7 minutes and 57 seconds
%% Cell type:markdown id:3686faff-691a-4263-b567-d9dc639ea934 tags:
# Evaluate different methods
In this notebook, we will evaluate how capable different methods for finding named entities in OCR document texts are. At the end of the notebook, we will select one of the methods that is sufficiently fast and accurate.
%% Cell type:markdown id:cddae558-fe33-48b9-8e1e-593ef72ea615 tags:
## Preliminaries
We will begin with a bit of boilerplate, logging information and setting up the computational environment.
%% Cell type:code id:1d851992-0506-4624-be6d-2dfaf5c60d3e tags:
``` python
! hostname
```
%% Output
apollo.fi.muni.cz
%% Cell type:code id:2b499462-149c-4dcb-9774-fc7172785f57 tags:
``` python
! python -V
```
%% Output
Python 3.8.5
%% Cell type:markdown id:ed161fcc-f8df-4fb4-bfb1-c52b0e721a37 tags:
Install the current version of the package and its dependencies.
%% Cell type:code id:60510f64-dcd9-4183-859c-e8a5cb3d1ea4 tags:
``` python
%%capture
! pip install .
```
%% Cell type:markdown id:9dcc3741-94a7-4398-aeba-0b2ad5f87976 tags:
Make sure numpy does not parallelize.
%% Cell type:code id:97eb08ee-533d-45f9-ab7a-1aa1153a6cd0 tags:
``` python
import os
```
%% Cell type:code id:518b76d8-d5b6-48bd-984f-31b9726ea0db tags:
``` python
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["VECLIB_MAXIMUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
```
%% Cell type:markdown id:abd4e260-a8c9-4409-a61d-7ea0b5c7f1a1 tags:
Pick the GPU that we will use.
%% Cell type:code id:fb7a4631-dcde-4d9d-a689-6cfd6134a8e6 tags:
``` python
! nvidia-smi -L
```
%% Output
GPU 0: NVIDIA A40 (UUID: GPU-177e5a84-366f-6464-1bbb-908f2dd979cc)
GPU 1: Tesla T4 (UUID: GPU-cf4e7061-619f-5b3b-a217-410f6d506d62)
GPU 2: Tesla T4 (UUID: GPU-00386b4a-741a-aac4-b833-b678a811936f)
GPU 3: Tesla T4 (UUID: GPU-10531c8c-13c3-8e82-302b-91a5615701d6)
GPU 4: Tesla T4 (UUID: GPU-82eac985-cf18-1379-cbcc-e8d71246e28c)
GPU 5: Tesla T4 (UUID: GPU-552f5db8-cec9-3733-3394-17c1ecbc8b85)
GPU 6: Tesla T4 (UUID: GPU-7d2ad51d-6c12-c878-1a30-a21a7fe9c7bd)
GPU 7: Tesla T4 (UUID: GPU-81bd2022-c6f6-4a67-d3f3-f461591e20ab)
GPU 8: Tesla T4 (UUID: GPU-4f6616fb-96e0-adbd-6ee5-7b6146de8ece)
GPU 9: Tesla T4 (UUID: GPU-197d3f17-6807-d6d8-a31c-f54ef78bcb2d)
GPU 10: Tesla T4 (UUID: GPU-e36ec7af-fa51-2498-6bb9-1f2e57bed4c5)
GPU 11: NVIDIA A100 80GB PCIe (UUID: GPU-2d25d82d-c487-73b0-9341-82e74253106e)
GPU 12: Tesla T4 (UUID: GPU-4195d034-0e80-bd51-3c68-3069d48177db)
GPU 13: Tesla T4 (UUID: GPU-030e587b-ae70-3854-4a86-b888f04de428)
GPU 14: Tesla T4 (UUID: GPU-c450823e-5524-7032-228b-140b3187d733)
GPU 15: Tesla T4 (UUID: GPU-8b6ef8ec-186a-2e88-d308-569892e57eeb)
GPU 16: Tesla T4 (UUID: GPU-7edb1e91-a5cb-40a4-b470-e1548a76e6d9)
%% Cell type:code id:211c301b-7c54-4d48-8c6c-2903f25a572b tags:
``` python
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "14"
```
%% Cell type:markdown id:0b4d82c9-2ea5-4eae-a6b6-90f0509a5ab3 tags:
Set up logging to display informational messages.
%% Cell type:code id:ef0c3813-27aa-42d3-b3b2-88d24b672644 tags:
``` python
import logging
import sys
```
%% Cell type:code id:e9574c4f-5d18-4d02-b8f3-fde770c7e1b9 tags:
``` python
logging.basicConfig(stream=sys.stdout, level=logging.INFO, format='%(message)s')
```
%% Cell type:markdown id:fd19862d-3967-4ee8-9e4c-79c2c01abf54 tags:
## Load documents and entities
First, we will load all documents and entities.
%% Cell type:code id:d987f79c-7352-492a-8a22-93a5fe97653c tags:
``` python
from ahisto_named_entity_search.entity import Entity, Person, Place, load_entities
```
%% Cell type:code id:b5b7fce4-74a0-474b-a47a-a970e57e3a83 tags:
``` python
all_entities = load_entities()
```
%% Output
Loading entities: 100%|███████████| 4182/4182 [00:05<00:00, 793.86it/s]
Loaded 20508 entities: 4350 places (21.21%) and 16158 persons (78.79%).
%% Cell type:code id:50d0c8ee-72ae-4e63-9d0a-7a7b1ceb4175 tags:
``` python
from ahisto_named_entity_search.document import Document, load_documents
```
%% Cell type:code id:6d0979d8-10f3-45b1-850c-f1ebdc1e893e tags:
``` python
documents = load_documents()
```
%% Output
Loading documents: 100%|████| 268669/268669 [00:06<00:00, 40200.07it/s]
%% Cell type:markdown id:5b17dd4c-409d-4b44-b416-80c3617b989d tags:
## Take a subset of entities for evaluation
%% Cell type:markdown id:b7f15c38-f1d4-4aa7-b048-f8e5dcd0284f tags:
Next, we will find the shortest and longest entities.
``` python
>>> def entity_length(entity): return len(str(entity))
```
We need to filter out too short entities, because they often contain only connectives, punctiation, numerals, etc.:
``` python
>>> sorted(all_entities, key=entity_length)[:58][-3:]
[Person: a, Person: (, Place: a, Person: hy, Person: XL]
>>> sorted(all_entities, key=entity_length)[:58][-3:]
[Person: Iva, Place: Háj, Person: Ota]
```
%% Cell type:code id:dca37c39-d0cc-401e-8186-6043fb2b5ea5 tags:
``` python
shortest_entities = [
Person('Iva'),
Place('Háj'),
Person('Ota'),
]
```
%% Cell type:markdown id:59bae0e8-e361-4a22-9037-3ae7e589fb53 tags:
We also need to filter out too long entities, because they are often composed of several entities:
```python
>>> sorted(all_entities, key=entity_length)[-1]
Person: Coln, Ach, Meincz, Worms, Straßburg, Basel, Hagnaw und den andern stetten in Ellseßen, Zurich, Luczern, Solottern, Mulhawsen, Northawsen, Frankfurt, Geylnhawsen, Fridberg, Winsheim, Sweinfurt, Ulme und die mit in in einung sein, Costencz und die mit in in einung sein, Freyburg in Uechtland, Freyburg in Preisgew, Preisach, Newemburg, Augspurg, Regenspurg, Eger, Heylprunnen, Wimpfen, Erffurd.
>>> sorted(all_entities, key=entity_length)[-146:][:3]
[Person: Fridrichen marggrafen zu Brandemburg des heiligen Romischen reichs ercamrer und burggrafen zu Nuremberg,
Person: paní Kláře Zalcarové, měštce v Olomúci i její erbom i tomu, kdož by tento list měl s jejich dobrú vuolí,
Person: Markéta, vdova po Buškovi z Rýzmberku, Janem a Bohuslavem, jejími syny z Blažimi sezením na hradě Bubnu]
```
%% Cell type:code id:d8160b15-97a7-40a2-a917-566adb64d198 tags:
``` python
longest_entities = [
Person('Fridrichen marggrafen zu Brandemburg des heiligen Romischen reichs ercamrer und burggrafen zu Nuremberg'),
Person('paní Kláře Zalcarové, měštce v Olomúci i její erbom i tomu, kdož by tento list měl s jejich dobrú vuolí'),
Person('Markéta, vdova po Buškovi z Rýzmberku, Janem a Bohuslavem, jejími syny z Blažimi sezením na hradě Bubnu'),
]
```
%% Cell type:markdown id:fe014004-946d-4a6f-9235-69c8f3c3858f tags:
Next, we will find entities in different single languages:
``` python
>>> ! pip install langdetect
>>> from langdetect import detect_langs, LangDetectException
>>> from collections import defaultdict
>>> import random
>>> entities_languages = defaultdict(lambda: list())
>>> for entity in random.sample(all_entities, k=len(all_entities)):
... try:
... best_lang, *other_langs = detect_langs(str(entity))
... if len(other_langs) == 0:
... entities_languages[best_lang.lang].append(entity)
... except LangDetectException:
... continue
>>> entities_languages['de'][:3]
[Person: Bischoff zu Wirtzpurg,
Place: in Trebicz,
Person: Weissenburg in Bayern] # Notice the wrong designation as a Person instead of a Place.
>>> entities_languages['cs'][:3]
[Place: kromě rybníka v velikého v Slatinie,
Person: Václava Králíka z Buřenic, tehdy správce olomouckého kostela a antiochijského patriarchy,
Person: Břeňka z Drštky]
>>> entities_languages['it'][:3]
[Person: Imperatori Sigismundo,
Person: Wladislaus dei gratia rex Polonie et cetera,
Person: Brandou da Castiglione]
```
%% Cell type:code id:c9aaa298-08ec-47de-9820-8cf21d723495 tags:
``` python
german_entities = [
Person('Bischoff zu Wirtzpurg'),
Place('in Trebicz'),
Person('Weissenburg in Bayern'),
]
```
%% Cell type:code id:edcf2409-a3bc-4c3b-82d1-5763f0c11df4 tags:
``` python
czech_entities = [
Place('kromě rybníka v velikého v Slatinie'),
Person('Václava Králíka z Buřenic, tehdy správce olomouckého kostela a antiochijského patriarchy'),
Person('Břeňka z Drštky'),
]
```
%% Cell type:code id:fe240ec4-3500-4826-a5e9-743ab4a049fc tags:
``` python
latin_entities = [
Person('Imperatori Sigismundo'),
Person('Wladislaus dei gratia rex Polonie et cetera'),
Person('Brandou da Castiglione'),
]
```
%% Cell type:markdown id:e3003d54-f165-436a-8919-76d33bf862d5 tags:
Next, we will find entities of different types:
``` python
>>> import random
>>> places = [entity for entity in sorted(all_entities) if isinstance(entity, Place)]
>>> persons = [entity for entity in sorted(all_entities) if isinstance(entity, Person)]
>>> random.choices(places, k=3)
[
'Kutná Hora',
'pražského kostela',
'Těšeticích',
]
>>> random.choices(persons, k=3)
[
'Aleš z Vrahovic',
'králem Zikmundem',
'husité',
]
```
%% Cell type:code id:55919a36-92ec-42bf-bd24-8cbb3f3221ff tags:
``` python
place_entities = [
Place('Kutná Hora'),
Place('pražského kostela'),
Place('Těšeticích'),
]
```
%% Cell type:code id:128221e6-3f64-4444-8f2f-b20d13c8f931 tags:
``` python
person_entities = [
Person('Aleš z Vrahovic'),
Person('králem Zikmundem'),
Person('husité'),
]
```
%% Cell type:markdown id:c516e8a2-efa2-4e53-95d6-ecfaf3f99a6f tags:
We will combine the sampled entities into a single list:
%% Cell type:code id:e41e7093-d443-4719-9da9-57e5c56473a9 tags:
``` python
entities = shortest_entities + longest_entities + german_entities + czech_entities + latin_entities + place_entities + person_entities
```
%% Cell type:code id:84a570e1-d38b-4d9f-9bf2-ad104260057f tags:
``` python
for entity in entities:
assert entity in all_entities, entity
```
%% Cell type:code id:042059cc-26ca-4ce4-8760-a38dd99cd79b tags:
``` python
print(f'We sampled {len(entities)} entities.')
```
%% Output
We sampled 21 entities.
%% Cell type:markdown id:b91345dd tags:
We will also combine the sampled entities into a dict structured into categories and subcategories to be used in the evaluation.
%% Cell type:code id:3a905ba6 tags:
``` python
entity_categories = {
'Length': {
'Shortest': shortest_entities,
'Longest': longest_entities,
},
'Language': {
'German': german_entities,
'Czech': czech_entities,
'Latin': latin_entities,
},
'Type': {
'Place': place_entities,
'Person': person_entities,
},
'All': {
None: entities,
},
}
```
%% Cell type:code id:fc635651 tags:
``` python
unique_entity_categories = set()
for category in entity_categories:
for subcategory in entity_categories[category]:
for entity in entity_categories[category][subcategory]:
assert entity in all_entities, entity
unique_entity_categories.add(entity)
assert len(unique_entity_categories) == len(entities), num_entity_categories
```
%% Cell type:markdown id:60923e80-091b-4d0c-9042-f33de23a60bc tags:
In the evaluation, we exclude the entity \`\`Iva'' from the overall evaluation, because it produces a large number of results that are affected by OCR errors.
%% Cell type:code id:e34044ce-c25e-4daa-a994-6abc62daf68f tags:
``` python
entity_categories['All'][None] = [
entity
for entity
in entities
if entity != Person('Iva')
]
```
%% Cell type:markdown id:1836aab7-3a18-4fd5-820e-b055304cf417 tags:
## Compute and save search results
After we have selected a subset of entities, we will search for the entities in OCR texts. We will try a number of methods, some of which will be indexpensive and will serve to select candidates for expensive methods that will produce the final results.
%% Cell type:code id:a5d2ebd8-de47-4212-ba46-8125dc99afaa tags:
``` python
from json import JSONDecodeError
```
%% Cell type:code id:c6180448-8477-48b6-a66f-a39f2d279326 tags:
``` python
from ahisto_named_entity_search.search import Search
from ahisto_named_entity_search.search import SearchResultList
```
%% Cell type:markdown id:4a5a8ca3-1480-4131-9d1f-d34dc758d140 tags:
### Inexpensive Methods
First, we will use a number of inexpensive and less accurate methods to select candidate results for the more expensive and accurate methods.
%% Cell type:markdown id:79052940-5e0a-49f8-9098-ea8096e13470 tags:
#### Jaccard Similarity
First, we will slide a window across the OCR texts and we will exhaustively compute the Jaccard index between each entity and each window. We will consider sets of character N-grams as well as sets of words. Due to the low number of OCR texts and the constant time complexity of the Jaccard index, this is computationally feasible. This method accurately detects the position of an entity in a text, but the precision of the Jaccard index on the semantic text similarity task is low.
%% Cell type:code id:97cec1d8-e74d-4f43-8a2e-57cac22b507f tags:
``` python
from ahisto_named_entity_search.index import CharacterJaccardSimilarityIndex
```
%% Cell type:code id:f76de940-a50f-4efa-b55e-ab8821d073e5 tags:
``` python
try:
jaccard_similarity_character_results = SearchResultList.load('character-jaccard-similarity', entities)
except (IOError, JSONDecodeError):
jaccard_similarity_character_results = Search(CharacterJaccardSimilarityIndex(documents.values())).search(entities)
jaccard_similarity_character_results.save('character-jaccard-similarity')
print(jaccard_similarity_character_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/character-jaccard-similarity.json
Retrieved 210000 results for 21 entities (10000.00 on average, 10000 at minimum) in 23 hours using CharacterJaccardSimilarityIndex.
%% Cell type:code id:3dd595c2-6384-4faf-ba8b-902fee7660b8 tags:
``` python
from ahisto_named_entity_search.index import WordJaccardSimilarityIndex
```
%% Cell type:code id:02328c7a-fd02-4d34-9ff7-21e8d99432b6 tags:
``` python
try:
jaccard_similarity_word_results = SearchResultList.load('word-jaccard-similarity', entities)
except (IOError, JSONDecodeError):
jaccard_similarity_word_results = Search(WordJaccardSimilarityIndex(documents.values())).search(entities)
jaccard_similarity_word_results.save('word-jaccard-similarity')
print(jaccard_similarity_word_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/word-jaccard-similarity.json
Retrieved 137169 results for 21 entities (6531.86 on average, 7 at minimum) in 23 hours using WordJaccardSimilarityIndex.
%% Cell type:markdown id:97364dae-a483-4152-8582-de58f39be3d6 tags:
#### Okapi BM25
Next, we will again slide a window across the OCR text and we will extract and index passages using an Okapi BM25 vector space model. Due to the low number of OCR texts and the constant time complexity of vector space models, this is computationally feasible. This method accurately detects the position of an entity in a text, but Okapi BM25 requires tokenization into words, which results in poor precision on OCR texts.
%% Cell type:code id:a86ca6d5-4b85-4372-8f20-8f3ad3499c92 tags:
``` python
from ahisto_named_entity_search.index import OkapiBM25Index
```
%% Cell type:code id:2bf4a9c7-6b87-4d99-ab17-ae5436f0414b tags:
``` python
try:
okapi_bm25_results = SearchResultList.load('okapi-bm25', entities)
except (IOError, JSONDecodeError):
okapi_bm25_results = Search(OkapiBM25Index(documents.values())).search(entities)
okapi_bm25_results.save('okapi-bm25')
print(okapi_bm25_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/okapi-bm25.json
Retrieved 137169 results for 21 entities (6531.86 on average, 7 at minimum) in 17 hours using OkapiBM25Index.
%% Cell type:markdown id:dd12d5bf-325f-4eac-bedb-39b7014a3d3b tags:
#### Fuzzy Regexes
Next, we will try to find exact and almost-exact matches of the entities in the OCR texts using [fuzzy regexes][1]. Fuzzy regexes offer high precision, but low recall, since only exact and almost-exact character-level matches will be found. Unlike the Jaccard index, fuzzy regexes don't use a sliding window: a properly-sized window with a match is automatically found in a full document. Almost-exact matches can be produced by allowing a small number of errors; a allowing a larger number of errors is not computationally feasible, because the time complexity is quadratic in the number of errors. To ensure that almost-exact matches can be found even for long entities, we divide them into shorter spans of text and search for them separately.
[1]: https://pypi.org/project/regex/#approximate-fuzzy-matching-hg-issue-12-hg-issue-41-hg-issue-109
%% Cell type:code id:59dd3ecf-ad9b-4f01-9f0a-2b8531bb176d tags:
``` python
from ahisto_named_entity_search.index import FuzzyRegexIndex
```
%% Cell type:code id:6d32f33a-7351-424a-997d-951693ba6c4e tags:
``` python
try:
fuzzy_regex_results = SearchResultList.load('fuzzy-regex', entities)
except (IOError, JSONDecodeError):
fuzzy_regex_results = Search(FuzzyRegexIndex(documents.values())).search(entities)
fuzzy_regex_results.save('fuzzy-regex')
print(fuzzy_regex_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/fuzzy-regex.json
Retrieved 34416 results for 21 entities (1638.86 on average, 0 at minimum) in a day using FuzzyRegexIndex.
%% Cell type:markdown id:d32c1040-dbbe-4758-a4ef-6718cb8adc2c tags:
#### Manatee
Finally, we will try to find exact and almost-exact matches of the entities in the OCR texts using the boolean retrieval search engine of [Manatee][1]. Manatee offers high precision, but low recall, since only exact and almost-exact matches will be found. Unlike the Jaccard index and like fuzzy regexes, Manatee does not use a sliding window: a properly-sized window with a match is automatically found in a full document using a positional inverted index. Almost-exact matches are produced by applying Czech lemmatization to both the queries and the documents.
[1]: https://nlp.fi.muni.cz/trac/noske#manatee
%% Cell type:code id:84b6f089-c421-45b9-9818-95a8f990cbe9 tags:
``` python
from ahisto_named_entity_search.index import RemoteManateeIndex
```
%% Cell type:code id:40d317f4-2de2-42ba-90a8-5b33de25dd5c tags:
``` python
try:
manatee_results = SearchResultList.load('manatee', entities)
except (IOError, JSONDecodeError):
manatee_results = Search(RemoteManateeIndex(documents.values())).search(entities)
manatee_results.save('manatee')
print(manatee_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/manatee.json
Retrieved 678 results for 21 entities (32.29 on average, 0 at minimum) in 46 seconds using RemoteManateeIndex.
%% Cell type:markdown id:457b3ff1-a15e-4c80-b0d7-4ad5da37c7bf tags:
### Expensive Methods
Next, we will use the results produced by the inexpensive and less accurate methods as candidate results to be reranked by the expensive and accurate methods:
%% Cell type:code id:3e78a7ac-1c34-4bb1-8f15-84d388bd5da2 tags:
``` python
inexpensive_candidates = [
jaccard_similarity_character_results,
jaccard_similarity_word_results,
okapi_bm25_results,
fuzzy_regex_results,
manatee_results,
]
```
%% Cell type:markdown id:f2ceb49e-52cb-48d2-9290-d6cd85f0d473 tags:
#### Edit Distance
First, we will rerank the candidate results using the character and word error rate. This is necessary, because the edit distance has quadratic time complexity in the text size, which makes it extremely *expensive*. If at least some of the candidate results are representative, the edit distance offers moderate precision on the semantic text similarity task.
%% Cell type:code id:fe9f31ae-01e1-4b67-bcd7-4a131a4ccd2e tags:
``` python
from ahisto_named_entity_search.index import CharacterEditSimilarityIndex
```
%% Cell type:code id:f1446ccd-2916-4dfb-8bf0-a04cd05a04d4 tags:
``` python
try:
edit_similarity_character_results = SearchResultList.load('character-edit-similarity', entities)
except (IOError, JSONDecodeError):
edit_similarity_character_results = Search(CharacterEditSimilarityIndex(inexpensive_candidates)).search(entities)
edit_similarity_character_results.save('character-edit-similarity')
print(edit_similarity_character_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/character-edit-similarity.json
Retrieved 209478 results for 21 entities (9975.14 on average, 9514 at minimum) in 5 minutes using CharacterEditSimilarityIndex.
%% Cell type:code id:04dcef80-9df7-497a-86a5-8c0db9242457 tags:
``` python
from ahisto_named_entity_search.index import WordEditSimilarityIndex
```
%% Cell type:code id:b28111da-5f74-4d07-a7ce-cd8d99f85889 tags:
``` python
try:
edit_similarity_word_results = SearchResultList.load('word-edit-similarity', entities)
except (IOError, JSONDecodeError):
edit_similarity_word_results = Search(WordEditSimilarityIndex(inexpensive_candidates)).search(entities)
edit_similarity_word_results.save('word-edit-similarity')
print(edit_similarity_word_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/word-edit-similarity.json
Retrieved 131398 results for 21 entities (6257.05 on average, 7 at minimum) in a minute using WordEditSimilarityIndex.
%% Cell type:markdown id:aab2ceda-543b-4c0e-9bb3-fd641ea200f9 tags:
#### BERTScore
Next, we will also rerank the candidate results using the [BERT F-Score][1]. Although originally designed for neural machine translation evaluation, the BERT F₁-Score is a symmetric similarity measure. Unlike the edit distance, which still measures mostly syntactic similarity, the BERT F₁-Score offers excellent precision on the semantic text similarity task.
[1]: https://arxiv.org/abs/1904.09675
%% Cell type:code id:cff2be7b-5756-4e40-b75f-6c663733124c tags:
``` python
from ahisto_named_entity_search.index import BERTScoreIndex
```
%% Cell type:code id:157a8125-1738-4cef-adba-1d5ded695908 tags:
``` python
try:
bert_score_results = SearchResultList.load('bert-score', entities)
except (IOError, JSONDecodeError):
import transformers
transformers.logging.set_verbosity_error()
bert_score_results = Search(BERTScoreIndex(inexpensive_candidates), num_workers=6).search(entities)
bert_score_results.save('bert-score')
print(bert_score_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/bert-score.json
Retrieved 210000 results for 21 entities (10000.00 on average, 10000 at minimum) in 9 minutes using BERTScoreIndex.
%% Cell type:markdown id:c44ba491-a278-4bcf-8097-5d8ca3fb3d14 tags:
#### SentenceBERT
%% Cell type:markdown id:46bcdfd6-380a-443c-bb1d-461d3a87f478 tags:
Finally, we will also rerank the candidate results using the cosine similarity between [siamese BERT][1] embeddings. Like the BERT F₁-Score, the siamese BERT embeddings offer excellent precision on the semantic text similarity task.
[1]: https://arxiv.org/abs/1908.10084
%% Cell type:code id:693e1458-cb0f-44c7-8eec-b90155d7a6cb tags:
``` python
from ahisto_named_entity_search.index import SentenceBERTSimilarityIndex
```
%% Cell type:code id:ce91d87e-f148-4cd2-b621-8c0700e662f7 tags:
``` python
try:
sentence_bert_similarity_results = SearchResultList.load('sbert-similarity', entities)
except (IOError, JSONDecodeError):
logging.getLogger('sentence_transformers.SentenceTransformer').setLevel(logging.WARNING)
sentence_bert_similarity_results = Search(SentenceBERTSimilarityIndex(inexpensive_candidates), num_workers=1).search(entities)
sentence_bert_similarity_results.save('sbert-similarity')
print(sentence_bert_similarity_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/sbert-similarity.json
Retrieved 210000 results for 21 entities (10000.00 on average, 10000 at minimum) in 11 minutes using SentenceBERTSimilarityIndex.
%% Cell type:markdown id:ee7d17da-0818-4f67-ab76-e29ed04ef2db tags:
### Rank Fusion
Finally, we will use rank fusion to combine the results of all the above methods.
%% Cell type:code id:2eb19abd-b0c3-4196-b517-6f4781706689 tags:
``` python
expensive_candidates = [
edit_similarity_character_results,
edit_similarity_word_results,
bert_score_results,
sentence_bert_similarity_results,
]
```
%% Cell type:code id:350fa8a5-0fa7-43a8-826b-f2f4bbc709f2 tags:
``` python
candidates = inexpensive_candidates + expensive_candidates
```
%% Cell type:markdown id:c26f01d3-146d-4587-8c71-40df6796e8dc tags:
#### Reciprocal Rank Fusion
First, we will use [the reciprocal rank fusion (RRF)][1] to combine the results of all the above methods.
[1]: https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf
%% Cell type:code id:d2bcccd7-2904-4d1b-adba-b574eeeeb81a tags:
``` python
from ahisto_named_entity_search.index import ReciprocalRankFusionIndex
```
%% Cell type:code id:498988f9-b0fd-4512-b8fe-804042398572 tags:
``` python
try:
reciprocal_rank_fusion_results = SearchResultList.load('reciprocal-rank-fusion', entities)
except (IOError, JSONDecodeError):
reciprocal_rank_fusion_results = Search(ReciprocalRankFusionIndex(candidates)).search(entities)
reciprocal_rank_fusion_results.save('reciprocal-rank-fusion')
print(reciprocal_rank_fusion_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/reciprocal-rank-fusion.json
Retrieved 210000 results for 21 entities (10000.00 on average, 10000 at minimum) in 29 seconds using ReciprocalRankFusionIndex.
%% Cell type:markdown id:100197d8-f520-4cfc-a58a-752fa275fc6b tags:
#### Concatenation
%% Cell type:markdown id:4a145fef-fcbe-49f7-b8d7-b28df3043808 tags:
Next, we will concatenate the results of the highly precise fuzzy regexes and the results of the RRF. In other words: If there are results from the fuzzy regexes, we will return them. Otherwise, we will fall back to the results of the RRF, which can be more of a mixed bag.
%% Cell type:code id:1d66a38e-3d32-4022-b2af-25d7c88bdb8a tags:
``` python
from ahisto_named_entity_search.index import ConcatenatedIndex
```
%% Cell type:code id:f4f504b8-6d68-422a-8b91-0478358f31a1 tags:
``` python
try:
concatenated_index_results = SearchResultList.load('fuzzy-regex-and-reciprocal-rank-fusion', entities)
except (IOError, JSONDecodeError):
concatenated_index_results = Search(ConcatenatedIndex([fuzzy_regex_results, reciprocal_rank_fusion_results])).search(entities)
concatenated_index_results.save('fuzzy-regex-and-reciprocal-rank-fusion')
print(concatenated_index_results)
```
%% Output
Loaded /nlp/projekty/ahisto/public_html/named-entity-search/results/fuzzy-regex-and-reciprocal-rank-fusion.json
Retrieved 210000 results for 21 entities (10000.00 on average, 10000 at minimum) in 23 seconds using ConcatenatedIndex.
%% Cell type:markdown id:46b23d2c-911c-4113-a17f-98f1766a338f tags:
## Annotate the search results
After we have gathered all the search results, we will produce a spreadsheet and pass it to annotators.
%% Cell type:code id:9d9652c7-de6e-4400-a933-0617e066487e tags:
``` python
all_results = candidates + [reciprocal_rank_fusion_results, concatenated_index_results]
```
%% Cell type:code id:981463f1-ca46-41fa-b62a-7b11e617c3e0 tags:
``` python
from ahisto_named_entity_search.search import Annotations
```
%% Cell type:code id:99e73fca-9809-4c06-8881-4c57fd47001a tags:
``` python
Annotations.create_annotation_template('annotation-template.xlsx', entities, all_entities, all_results)
```
%% Cell type:markdown id:448dc212 tags:
## Evaluate the search results
Using the annotations, we can now determine how capable the methods are using precision and recall. To decide the best search engine, we compute the weighted harmonic mean [the $F_\beta$-score][1], where $\beta = 0.25$ so that precision is given four times more weight than recall.
[1]: https://en.wikipedia.org/wiki/F-score
%% Cell type:code id:396a412d tags:
``` python
annotations = Annotations('annotations.xlsx', entities, documents)
```
%% Cell type:code id:1a472062 tags:
``` python
from collections import defaultdict
from typing import Dict, Tuple, List, Optional
from ahisto_named_entity_search.search import Evaluation
```
%% Cell type:code id:dab15c7b tags:
``` python
precisions: Dict[str, Tuple[List[float]]] = defaultdict(list)
recalls: Dict[str, Tuple[List[float]]] = defaultdict(list)
f_scores: Dict[str, Tuple[List[float]]] = defaultdict(list)
index: List[Tuple[str, Optional[str]]] = list()
for category in entity_categories:
for subcategory in entity_categories[category]:
index.append((category, subcategory))
for result_list in all_results:
evaluation = Evaluation(result_list, annotations, entity_categories[category][subcategory])
precisions[result_list.index_name].append(evaluation.precision)
recalls[result_list.index_name].append(evaluation.recall)
f_scores[result_list.index_name].append(evaluation.f_score)
```
%% Cell type:code id:65c9af8e tags:
``` python
import pandas as pd
from pandas import DataFrame, MultiIndex
```
%% Cell type:code id:e7eaec9b tags:
``` python
columns = MultiIndex.from_tuples(index, names=('Category', 'Subcategory'))
```
%% Cell type:markdown id:add927f3-bb63-42db-b4b9-c2e32a38c5d6 tags:
### Precision
Manatee achieved the best precision in all categories except *Longest*, where it retrieved no results. It also achieved the perfect 100% precision in all categories except *Longest* and *Shortest*. The second and third best systems on average in terms of precision are fuzzy regexes and Jaccard similarity.
%% Cell type:code id:694d6575-75d7-480d-9b1e-f8a456fcc789 tags:
``` python
precision_df = DataFrame.from_dict(precisions, orient='index', columns=columns)
precision_df = precision_df.sort_values(by=('All', None), ascending=False)
precision_df = precision_df.apply(lambda data: 100.0 * data)
precision_df.style.format('{:.2f}%')
```
%% Output
<pandas.io.formats.style.Styler at 0x7f9090780040>
%% Cell type:code id:a831c8e2-7b24-440a-bab1-e821ae0de252 tags:
``` python
_ = precision_df.T.plot.bar(title='Precision', rot=15, figsize=(15, 5), grid=True).legend(loc=(1.01, 0.44))
```
%% Output
%% Cell type:markdown id:c5aeedb2-7649-4580-8070-12e7991c92c5 tags:
### Recall
Edit distance received the best recall, closely followed by fuzzy regexes concatenated with reciprocal rank fusion, BERT F₁-Score, and a tie between fuzzy regexes and SentenceBERT embeddings.
%% Cell type:code id:7d4bea8a-1216-4d4c-9263-92516741210e tags:
``` python
recall_df = DataFrame.from_dict(recalls, orient='index', columns=columns)
recall_df = recall_df.sort_values(by=('All', None), ascending=False)
recall_df = recall_df.apply(lambda data: 100.0 * data)
recall_df.style.format('{:.2f}%')
```
%% Output
<pandas.io.formats.style.Styler at 0x7f90888f1040>
%% Cell type:code id:b0b43a3f-ee45-4170-8425-f697e647d4ba tags:
``` python
_ = recall_df.T.plot.bar(title='Recall', rot=15, figsize=(15, 5), grid=True).legend(loc=(1.01, 0.44))
```
%% Output
%% Cell type:markdown id:947bddd4-a621-4361-8562-5c16197dfcb9 tags:
### $F_\beta$-score
Manatee received the best $F_\beta$-score and placed well ahead of the second fuzzy regexes, which were closely followed by the edit distance.
%% Cell type:code id:efc953b4-d7de-4877-b2c9-183fa014b9e8 tags:
``` python
f_score_df = DataFrame.from_dict(f_scores, orient='index', columns=columns)
f_score_df = f_score_df.sort_values(by=('All', None), ascending=False)
f_score_df = f_score_df.apply(lambda data: 100.0 * data)
f_score_df.style.format('{:.2f}%')
```
%% Output
<pandas.io.formats.style.Styler at 0x7f90796a16d0>
%% Cell type:code id:a921ef67-303f-47e1-bd84-d5f01cf0fce6 tags:
``` python
_ = f_score_df.T.plot.bar(title=r'$F_\beta$-score', rot=15, figsize=(15, 5), grid=True).legend(loc=(1.01, 0.44))
```
%% Output
%% Cell type:markdown id:5ddc1a0d-baa4-445c-ae2c-b4e5225ec87d tags:
We decided to use Manatee to produce our dataset for training named entity recognition models. As our second choice, we will use fuzzy regexes.
We decided to use Manatee to produce our dataset for training named entity recognition models. As our second choice, we will use fuzzy regexes due to their higher recall.
%% Cell type:code id:66d83848-4212-4201-8222-dbfca803c82b tags:
``` python
```
......
......@@ -1363,7 +1363,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
"version": "3.8.10"
}
},
"nbformat": 4,
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment