"We decided to use Manatee to produce our dataset for training named entity recognition models. As our second choice, we will use fuzzy regexes."
"We decided to use Manatee to produce our dataset for training named entity recognition models. As our second choice, we will use fuzzy regexes due to their higher recall."
In this notebook, we will evaluate how capable different methods for finding named entities in OCR document texts are. At the end of the notebook, we will select one of the methods that is sufficiently fast and accurate.
In the evaluation, we exclude the entity \`\`Iva'' from the overall evaluation, because it produces a large number of results that are affected by OCR errors.
After we have selected a subset of entities, we will search for the entities in OCR texts. We will try a number of methods, some of which will be indexpensive and will serve to select candidates for expensive methods that will produce the final results.
First, we will slide a window across the OCR texts and we will exhaustively compute the Jaccard index between each entity and each window. We will consider sets of character N-grams as well as sets of words. Due to the low number of OCR texts and the constant time complexity of the Jaccard index, this is computationally feasible. This method accurately detects the position of an entity in a text, but the precision of the Jaccard index on the semantic text similarity task is low.
Next, we will again slide a window across the OCR text and we will extract and index passages using an Okapi BM25 vector space model. Due to the low number of OCR texts and the constant time complexity of vector space models, this is computationally feasible. This method accurately detects the position of an entity in a text, but Okapi BM25 requires tokenization into words, which results in poor precision on OCR texts.
Next, we will try to find exact and almost-exact matches of the entities in the OCR texts using [fuzzy regexes][1]. Fuzzy regexes offer high precision, but low recall, since only exact and almost-exact character-level matches will be found. Unlike the Jaccard index, fuzzy regexes don't use a sliding window: a properly-sized window with a match is automatically found in a full document. Almost-exact matches can be produced by allowing a small number of errors; a allowing a larger number of errors is not computationally feasible, because the time complexity is quadratic in the number of errors. To ensure that almost-exact matches can be found even for long entities, we divide them into shorter spans of text and search for them separately.
Finally, we will try to find exact and almost-exact matches of the entities in the OCR texts using the boolean retrieval search engine of [Manatee][1]. Manatee offers high precision, but low recall, since only exact and almost-exact matches will be found. Unlike the Jaccard index and like fuzzy regexes, Manatee does not use a sliding window: a properly-sized window with a match is automatically found in a full document using a positional inverted index. Almost-exact matches are produced by applying Czech lemmatization to both the queries and the documents.
Next, we will use the results produced by the inexpensive and less accurate methods as candidate results to be reranked by the expensive and accurate methods:
First, we will rerank the candidate results using the character and word error rate. This is necessary, because the edit distance has quadratic time complexity in the text size, which makes it extremely *expensive*. If at least some of the candidate results are representative, the edit distance offers moderate precision on the semantic text similarity task.
Next, we will also rerank the candidate results using the [BERT F-Score][1]. Although originally designed for neural machine translation evaluation, the BERT F₁-Score is a symmetric similarity measure. Unlike the edit distance, which still measures mostly syntactic similarity, the BERT F₁-Score offers excellent precision on the semantic text similarity task.
Finally, we will also rerank the candidate results using the cosine similarity between [siamese BERT][1] embeddings. Like the BERT F₁-Score, the siamese BERT embeddings offer excellent precision on the semantic text similarity task.
Next, we will concatenate the results of the highly precise fuzzy regexes and the results of the RRF. In other words: If there are results from the fuzzy regexes, we will return them. Otherwise, we will fall back to the results of the RRF, which can be more of a mixed bag.
Using the annotations, we can now determine how capable the methods are using precision and recall. To decide the best search engine, we compute the weighted harmonic mean [the $F_\beta$-score][1], where $\beta = 0.25$ so that precision is given four times more weight than recall.
Manatee achieved the best precision in all categories except *Longest*, where it retrieved no results. It also achieved the perfect 100% precision in all categories except *Longest* and *Shortest*. The second and third best systems on average in terms of precision are fuzzy regexes and Jaccard similarity.
Edit distance received the best recall, closely followed by fuzzy regexes concatenated with reciprocal rank fusion, BERT F₁-Score, and a tie between fuzzy regexes and SentenceBERT embeddings.
We decided to use Manatee to produce our dataset for training named entity recognition models. As our second choice, we will use fuzzy regexes.
We decided to use Manatee to produce our dataset for training named entity recognition models. As our second choice, we will use fuzzy regexes due to their higher recall.