# Alternative Second Term Project: ARQMath Collection, Answer Retrieval Task
“In a recent study, Mansouri et al. found that 20% of mathematical queries in a general-purpose search engine were expressed as well-formed questions, a rate ten times higher than that for all queries submitted. Results such as these and the presence of Community Question Answering sites such as Math Stack Exchange suggest there is interest in finding answers to mathematical questions posed in natural language, using both text and mathematical notation.” [1]
“[ARQMath](https://www.cs.rit.edu/~dprl/ARQMath/) is a co-operative evaluation exercise aiming to advance math-aware search and the semantic analysis of mathematical notation and texts. **ARQMath is being run for the second time at CLEF 2021.** An overview paper (including results) from ARQMath 2020 is available along with participant papers in the [CLEF 2020 working notes](http://ceur-ws.org/Vol-2696).” [2]
Your tasks, reviewed by your colleagues and the course instructors, are the following:
1.*Implement a supervised ranked retrieval system*, [3, Chapter 15] which will produce a list of documents from the TREC collection in a descending order of relevance to a query from the TREC collection. You SHOULD use training and validation relevance judgements from the TREC collection in your information retrieval system. Test judgements MUST only be used for the evaluation of your information retrieval system.
2.*Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.
*Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).
3.*Reach at least 1.2% mean average precision* [3, Section 8.4] with your system on the Trec collection. You are encouraged to use techniques for tokenization, [3, Section 2.2] document representation [3, Section 6.4], tolerant retrieval [3, Chapter 3], relevance feedback, query expansion, [3, Chapter 9], learning to rank [3, Chapter 15], and others discussed in the course.
4. _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).
The best student systems will enter the ARQMath competition and help develop the new search engine for [the Math StackExchange question answering forum](http://math.stackexchange.com/). This is not only useful, but also a nice reference for your CVs!
%% Cell type:markdown id: tags:
[1] Zanibbi, R. et al. [Overview of ARQMath 2020 (Updated Working Notes Version): CLEF Lab on Answer Retrieval for Questions on Math](http://ceur-ws.org/Vol-2696/paper_271.pdf). In: *Working Notes of CLEF 2020-Conference and Labs of the Evaluation Forum*. 2020.
[2] Zanibbi, R. et al. [*ARQMath: Answer Retrieval for Questions on Math*](https://www.cs.rit.edu/~dprl/ARQMath/index.html). Rochester Institute of Technology. 2021.
[3] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.
%% Cell type:markdown id: tags:
## Loading the ARQMath collection
First, we will install [our library](https://gitlab.fi.muni.cz/xstefan3/pv211-utils) and load the ARQMath collection. If you are interested, you can take a peek at [how we preprocessed the raw ARQMath collection](https://colab.research.google.com/drive/1WiLLDIBaOteLevNdxVonQrYSW919ckwO) to the final dataset that we will be using.
The questions and answers from the ARQMath collection, and the queries from the from the answer retrieval task of ARQMath 2020 contain both text and mathematical formulae. We have prepared several encodings of the text and mathematical, which you can choose from:
-`text` – Plain text, which contains no mathematical formulae. *Nice and easy*, but you are losing all information about the math:
> Finding value of such that ...
-`text+latex` – Plain text with mathematical formulae in LaTeX surrounded by dollar signs. Still quite nice to work:
> Finding value of \$c\$ such that ...
-`text+prefix` – Plain text with mathematical formulae in [the prefix format][1]. Unlike LaTeX, which encodes how a mathematical formula looks, the prefix format encodes the semantic content of the formulae using [the Polish notation][2].
> Finding value of V!𝑐 such that ...
-`xhtml+latex` – XHTML text with mathematical formulae in LaTeX, surrounded by the `<span class="math-container">` tags:
> ``` html
> <p>Finding value of <span class="math-container">$c$</span> such that ...
> ```
-`xhtml+pmml` – XHTML text with mathematical formulae in the [Presentation MathML][4] XML format, which encodes how a mathematical formula looks:
> ``` html
> <p>Finding value of <math><mi>c</mi></math> such that'
> ```
-`xhtml+cmml` – XHTML text with mathematical formulae in the [Content MathML][3] XML format, which encodes the semantic content of a formula. This format is *much more difficult to work with*, but it allows you to represent mathematical formulae structurally and use XML Retrieval [3, Chapter 10].
> ``` html
> <p>Finding value of <math><ci>𝑐</ci></math> such that ...
Next, we will define a class named `Answer` that will represent a preprocessed answer from the ARQMath 2020 collection. Tokenization and preprocessing of the `body` attribute of the individual answers as well as the creative use of the `upvotes` and `is_accepted` attributes is left to your imagination and craftsmanship.
%% Cell type:code id: tags:
```
from pv211_utils.arqmath.entities import ArqmathAnswerBase
class Answer(ArqmathAnswerBase):
"""A preprocessed answer from the ARQMath 2020 collection.
Parameters
----------
document_id : str
A unique identifier of the answer among all questions and answers.
body : str
The text of the answer, including mathematical formulae.
upvotes : int
The number of upvotes for the answer.
is_accepted : bool
If the answer has been accepted by the poster of the question.
We will load answers into the `answers`[ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each answer is an instance of the `Answer` class that we have just defined.
%% Cell type:code id: tags:
```
from pv211_utils.arqmath.loader import load_answers
print('\n'.join(repr(answer) for answer in list(answers.values())[:3]))
print('...')
print('\n'.join(repr(answer) for answer in list(answers.values())[-3:]))
```
%% Cell type:markdown id: tags:
For a demonstration, we will load [the accepted answer from the image above][1].
[1]:https://math.stackexchange.com/a/30741
%% Cell type:code id: tags:
```
answer = answers['30741']
answer
```
%% Cell type:code id: tags:
```
print(answer.body)
```
%% Cell type:code id: tags:
```
print(answer.upvotes)
```
%% Cell type:code id: tags:
```
print(answer.is_accepted)
```
%% Cell type:markdown id: tags:
### Loading the questions
Next, we will define a class named `Question` that will represent a preprocessed question from the ARQMath 2020 collection. Tokenization and preprocessing of the `title` and `body` attributes of the individual questions as well as the creative use of the `tags`, `upvotes`, `views`, and `answers` attributes is left to your imagination and craftsmanship.
We will not be returning these questions from our search engine, but we could use them for example to look up similar existing questions to a query and then return the answers to these existing questions.
%% Cell type:code id: tags:
```
from typing import List
from pv211_utils.arqmath.entities import ArqmathQuestionBase
class Question(ArqmathQuestionBase):
"""A preprocessed question from the ARQMath 2020 collection.
Parameters
----------
document_id : str
A unique identifier of the question among all questions and answers.
title : str
The title of the question, including mathematical formulae.
body : str
The text of the question, including mathematical formulae.
We will load answers into the `questions`[ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each answer is an instance of the `Question` class that we have just defined.
%% Cell type:code id: tags:
```
from pv211_utils.arqmath.loader import load_questions
print('\n'.join(repr(question) for question in list(questions.values())[:3]))
print('...')
print('\n'.join(repr(question) for question in list(questions.values())[-3:]))
```
%% Cell type:markdown id: tags:
For a demonstration, we will load [the question from the image above][1].
[1]:https://math.stackexchange.com/q/30732
%% Cell type:code id: tags:
```
question = questions['30732']
question
```
%% Cell type:code id: tags:
```
print(question.title)
```
%% Cell type:code id: tags:
```
print(question.body)
```
%% Cell type:code id: tags:
```
print(question.tags)
```
%% Cell type:code id: tags:
```
print(question.upvotes)
```
%% Cell type:code id: tags:
```
print(question.views)
```
%% Cell type:code id: tags:
```
print(question.answers)
```
%% Cell type:code id: tags:
```
print([answer for answer in question.answers if answer.is_accepted])
```
%% Cell type:markdown id: tags:
### Loading the queries
Next, we will define a class named `Query` that will represent a preprocessed query from the answer retrieval task of ARQMath 2020. Tokenization and preprocessing of the `title` and `body` attributes of the individual questions as well as the creative use of the `tags` attribute is left to your imagination and craftsmanship.
%% Cell type:code id: tags:
```
from pv211_utils.arqmath.entities import ArqmathQueryBase
class Query(ArqmathQueryBase):
"""A preprocessed query from the answer retrieval task of ARQMath 2020.
Parameters
----------
query_id : int
A unique identifier of the query.
title : str
The title of the query, including mathematical formulae.
body : str
The text of the query, including mathematical formulae.
We will load queries into the `train_queries` and `validation_queries`[ordered dictionaries](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each query is an instance of the `Query` class that we have just defined. You should use `train_queries`, `validation_queries`, and *relevance judgements* (see the next section) for training your supervised information retrieval system.
%% Cell type:markdown id: tags:
If you are training just a single machine learning model without any early stopping or hyperparameter optimization, you can use `bigger_train_queries` as the input.
If you are training a single machine learning model with early stopping or hyperparameter optimization, you can use `train_queries` for training your model and `validation_queries` to stop early or to select the optimal hyperparameters for your model. You can then use `bigger_train_queries` to train the model with the best number of epochs or the best hyperparameters.
If you are training many machine learning models with early stopping or hyperparameter optimization, then you can split your train judgements to smaller training and validation sets. Then, you can use `smaller_train_queries` for training your models, `smaller_validation_queries` to stop early or to select the optimal hyperparameters for your models, and `validation_queries` to select the best model. You can then use `bigger_train_queries` to train the best model with the best number of epochs or the best hyperparameters.
print('\n'.join(repr(query) for query in list(train_queries.values())[:3]))
print('...')
print('\n'.join(repr(query) for query in list(train_queries.values())[-3:]))
```
%% Cell type:markdown id: tags:
For a demonstration, we will look at query number 5. This is a query that is relatively easy to answer using just the text of the query, not the mathematical formulae. The user is asking for a computational solution to an interesting puzzle.
%% Cell type:code id: tags:
```
query = train_queries[5]
query
```
%% Cell type:code id: tags:
```
print(query.title)
```
%% Cell type:code id: tags:
```
print(query.body)
```
%% Cell type:code id: tags:
```
print(query.tags)
```
%% Cell type:markdown id: tags:
### Loading the relevance judgements
Next, we will load train and validation relevance judgements into the `train_judgements` and `validation_judgement` sets. Relevance judgements specify, which answers are relevant to which queries. You should use relevance judgements for training your supervised information retrieval system.
%% Cell type:markdown id: tags:
If you are training just a single machine learning model without any early stopping or hyperparameter optimization, you can use `bigger_train_judgements` as the input.
If you are training a single machine learning model with early stopping or hyperparameter optimization, you can use `train_judgements` for training your model and `validation_judgements` to stop early or to select the optimal hyperparameters for your model. You can then use `bigger_train_judgements` to train the model with the best number of epochs or the best hyperparameters.
If you are training many machine learning models with early stopping or hyperparameter optimization, then you can split your train judgements to smaller training and validation sets. Then, you can use `smaller_train_judgements` for training your models, `smaller_validation_judgements` to stop early or to select the optimal hyperparameters for your models, and `validation_judgements` to select the best model. You can then use `bigger_train_judgements` to train the best model with the best number of epochs or the best hyperparameters.
For a demonstration, we will look at query number 5 and show a relevant answer to the query and a non-relevant answer to the query.
%% Cell type:code id: tags:
```
query = train_queries[5]
relevant_answer = answers['1037824']
irrelevant_answer = answers['432200']
```
%% Cell type:code id: tags:
```
query
```
%% Cell type:code id: tags:
```
relevant_answer
```
%% Cell type:code id: tags:
```
irrelevant_answer
```
%% Cell type:code id: tags:
```
(query, relevant_answer) in train_judgements
```
%% Cell type:code id: tags:
```
(query, irrelevant_answer) in train_judgements
```
%% Cell type:markdown id: tags:
## Implementation of your information retrieval system
Next, we will define a class named `IRSystem` that will represent your information retrieval system. Your class must define a method name `search` that takes a query and returns answers in descending order of relevance to the query.
The example implementation returns answers in decreasing order of the TF-IDF cosine similarity between the answer and the query. You can use the example implementation as a basis of your system, or you can replace it with your own implementation.
%% Cell type:code id: tags:
```
from multiprocessing import get_context
from typing import Iterable, Union, List, Tuple
from pv211_utils.arqmath.irsystem import ArqmathIRSystemBase
from gensim.corpora import Dictionary
from gensim.matutils import cossim
from gensim.models import TfidfModel
from gensim.similarities import SparseMatrixSimilarity
from gensim.utils import simple_preprocess
from tqdm import tqdm
class IRSystem(ArqmathIRSystemBase):
"""
A system that returns answers ordered by decreasing cosine similarity.
Finally, we will evaluate your information retrieval system using [the Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP) evaluation measure.
%% Cell type:code id: tags:
```
from pv211_utils.arqmath.leaderboard import ArqmathLeaderboard
from pv211_utils.arqmath.eval import ArqmathEvaluation
“The Cranfield collection [...] was the pioneering test collection in allowing CRANFIELD precise quantitative measures of information retrieval effectiveness [...]. Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.” [1, Section 8.2]
Your tasks, reviewed by your colleagues and the course instructors, are the following:
1.*Implement an unsupervised ranked retrieval system*, [1, Chapter 6] which will produce a list of documents from the Cranfield collection in a descending order of relevance to a query from the Cranfield collection. You MUST NOT use relevance judgements from the Cranfield collection in your information retrieval system. Relevance judgements MUST only be used for the evaluation of your information retrieval system.
2.*Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.
*Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).
3.*Reach at least 22% mean average precision*[1, Section 8.4] with your system on the Cranfield collection. You MUST record your score either in [the public leaderboard](https://docs.google.com/spreadsheets/d/e/2PACX-1vT0FoFzCptIYKDsbcv8LebhZDe_20GFeBAPmS-VyImlWbqET0T7I2iWy59p9SHbUe3LX1yJMhALPcCY/pubhtml) or in this Jupyter notebook. You are encouraged to use techniques for tokenization, [1, Section 2.2] document representation [1, Section 6.4], tolerant retrieval [1, Chapter 3], relevance feedback and query expansion, [1, Chapter 9] and others discussed in the course.
4. _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).
[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.
%% Cell type:markdown id: tags:
## Loading the Cranfield collection
First, we will install [our library](https://gitlab.fi.muni.cz/xstefan3/pv211-utils) and load the Cranfield collection.
Next, we will define a class named `Document` that will represent a preprocessed document from the Cranfield collection. Tokenization and preprocessing of the `title` and `body` attributes of the individual documents as well as the creative use of the `authors`, `bibliography`, and `title` attributes is left to your imagination and craftsmanship.
%% Cell type:code id: tags:
```
from typing import List
from pv211_utils.cranfield.entities import CranfieldDocumentBase
from gensim.utils import simple_preprocess
class Document(CranfieldDocumentBase):
"""
A preprocessed Cranfield collection document.
Parameters
----------
document_id : str
A unique identifier of the document.
authors : list of str
A unique identifiers of the authors of the document.
We will load documents into the `documents`[ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each document is an instance of the `Document` class that we have just defined.
%% Cell type:code id: tags:
```
from pv211_utils.cranfield.loader import load_documents
documents = load_documents(Document)
```
%% Cell type:code id: tags:
```
print('\n'.join(repr(document) for document in list(documents.values())[:3]))
print('...')
print('\n'.join(repr(document) for document in list(documents.values())[-3:]))
```
%% Cell type:code id: tags:
```
document = documents['14']
document
```
%% Cell type:code id: tags:
```
print(document.authors)
```
%% Cell type:code id: tags:
```
print(document.bibliography)
```
%% Cell type:code id: tags:
```
print(document.title)
```
%% Cell type:code id: tags:
```
print(document.body)
```
%% Cell type:markdown id: tags:
### Loading the queries
Next, we will define a class named `Query` that will represent a preprocessed query from the Cranfield collection. Tokenization and preprocessing of the `body` attribute of the individual queries is left to your craftsmanship.
%% Cell type:code id: tags:
```
from pv211_utils.cranfield.entities import CranfieldQueryBase
class Query(CranfieldQueryBase):
"""
A preprocessed Cranfield collection query.
Parameters
----------
query_id : int
A unique identifier of the query.
body : str
The text of the query.
"""
def __init__(self, query_id: int, body: str):
super().__init__(query_id, body)
```
%% Cell type:markdown id: tags:
We will load queries into the `queries`[ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each query is an instance of the `Query` class that we have just defined.
%% Cell type:code id: tags:
```
from pv211_utils.cranfield.loader import load_queries
queries = load_queries(Query)
```
%% Cell type:code id: tags:
```
print('\n'.join(repr(query) for query in list(queries.values())[:3]))
print('...')
print('\n'.join(repr(query) for query in list(queries.values())[-3:]))
```
%% Cell type:code id: tags:
```
query = queries[14]
query
```
%% Cell type:code id: tags:
```
print(query.body)
```
%% Cell type:markdown id: tags:
## Implementation of your information retrieval system
Next, we will define a class named `IRSystem` that will represent your information retrieval system. Your class must define a method name `search` that takes a query and returns documents in descending order of relevance to the query.\n\nThe example implementation returns documents in decreasing order of the bag-of-words cosine similarity between the document and the query. The example implementation returns documents in decreasing order of the TF-IDF cosine similarity between the document and the query. You can use the example implementation as a basis of your system, or you can replace it with your own implementation.
%% Cell type:code id: tags:
```
from typing import Iterable
from pv211_utils.cranfield.irsystem import CranfieldIRSystemBase
from gensim.corpora import Dictionary
from gensim.matutils import cossim
from gensim.similarities import SparseMatrixSimilarity
from gensim.utils import simple_preprocess
from tqdm import tqdm
class IRSystem(CranfieldIRSystemBase):
"""
A system that returns documents ordered by decreasing cosine similarity.
Attributes
----------
dictionary: Dictionary
The dictionary of the system.
index: MatrixSimilarity
The indexed documents.
index_to_document: dict of (int, Document)
A mapping from indexed document numbers to documents.
"""
def __init__(self):
document_bodies = (simple_preprocess(document.body) for document in documents.values())
document_bodies = tqdm(document_bodies, desc='Building the dictionary', total=len(documents))
dictionary = Dictionary(document_bodies)
document_vectors = (dictionary.doc2bow(simple_preprocess(document.body)) for document in documents.values())
document_vectors = tqdm(document_vectors, desc='Building the index', total=len(documents))
index = SparseMatrixSimilarity(document_vectors, num_docs=len(documents), num_terms=len(dictionary))
Finally, we will evaluate your information retrieval system using [the Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP) evaluation measure.
%% Cell type:code id: tags:
```
from pv211_utils.cranfield.loader import load_judgements
from pv211_utils.cranfield.leaderboard import CranfieldLeaderboard
from pv211_utils.cranfield.eval import CranfieldEvaluation
“The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992. [...] TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles. [...] Because the test document collections are so large, there are no exhaustive relevance judgments.” [1, Section 8.2]
Your tasks, reviewed by your colleagues and the course instructors, are the following:
1.*Implement a supervised ranked retrieval system*, [1, Chapter 15] which will produce a list of documents from the TREC collection in a descending order of relevance to a query from the TREC collection. You SHOULD use training and validation relevance judgements from the TREC collection in your information retrieval system. Test judgements MUST only be used for the evaluation of your information retrieval system.
2.*Document your code* in accordance with [PEP 257](https://www.python.org/dev/peps/pep-0257/), ideally using [the NumPy style guide](https://numpydoc.readthedocs.io/en/latest/format.html#docstring-standard) as seen in the code from exercises.
*Stick to a consistent coding style* in accordance with [PEP 8](https://www.python.org/dev/peps/pep-0008/).
3.*Reach at least 13.5% mean average precision* [1, Section 8.4] with your system on the TREC collection. You are encouraged to use techniques for tokenization, [1, Section 2.2] document representation [1, Section 6.4], tolerant retrieval [1, Chapter 3], relevance feedback, query expansion, [1, Chapter 9], learning to rank [1, Chapter 15], and others discussed in the course.
4. _[Upload an .ipynb file](https://is.muni.cz/help/komunikace/spravcesouboru#k_ss_1) with this Jupyter notebook to the homework vault in IS MU._ You MAY also include a brief description of your information retrieval system and a link to an external service such as [Google Colaboratory](https://colab.research.google.com/), [DeepNote](https://deepnote.com/), or [JupyterHub](https://iirhub.cloud.e-infra.cz/).
[1] Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to information retrieval*](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Cambridge university press, 2008.
%% Cell type:markdown id: tags:
## Loading the TREC collection
First, we will install [our library](https://gitlab.fi.muni.cz/xstefan3/pv211-utils) and load the TREC collection. If you are interested, you can take a peek at [how we preprocessed the raw TREC collection](https://colab.research.google.com/drive/1vT4UwsFCsi1xEZckqFVRzLgQS1kVzTuO) to the final dataset that we will be using.
Next, we will define a class named `Document` that will represent a preprocessed document from the TREC collection. Tokenization and preprocessing of the `body` attribute of the individual documents is left to your imagination and craftsmanship.
%% Cell type:code id: tags:
```
from pv211_utils.trec.entities import TrecDocumentBase
class Document(TrecDocumentBase):
"""
A preprocessed TREC collection document.
Parameters
----------
document_id : str
A unique identifier of the document.
body : str
The abstract of the document.
"""
def __init__(self, document_id: str, body: str):
super().__init__(document_id, body)
```
%% Cell type:markdown id: tags:
We will load documents into the `documents`[ordered dictionary](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each document is an instance of the `Document` class that we have just defined.
%% Cell type:code id: tags:
```
from pv211_utils.trec.loader import load_documents
print('\n'.join(repr(document) for document in list(documents.values())[:3]))
print('...')
print('\n'.join(repr(document) for document in list(documents.values())[-3:]))
```
%% Cell type:code id: tags:
```
document = documents['FT911-3']
document
```
%% Cell type:code id: tags:
```
print(document.body)
```
%% Cell type:markdown id: tags:
### Loading the queries
Next, we will define a class named `Query` that will represent a preprocessed query from the TREC collection. Tokenization and preprocessing of the `body` attribute of the individual queries as well as the creative use of the `title` and `narrative` attributes is left to your imagination and craftsmanship.
%% Cell type:code id: tags:
```
from pv211_utils.trec.entities import TrecQueryBase
class Query(TrecQueryBase):
"""
A preprocessed TREC collection query.
Parameters
----------
query_id : int
Up to three words that best describe the query.
title : str
Up to three words that best describe the query.
body : str
A one-sentence description of the topic area.
narrative : str
A concise description of what makes a document relevant.
We will load queries into the `train_queries` and `validation_queries`[ordered dictionaries](https://docs.python.org/3.8/library/collections.html#collections.OrderedDict). Each query is an instance of the `Query` class that we have just defined. You should use `train_queries`, `validation_queries`, and *relevance judgements* (see the next section) for training your supervised information retrieval system.
%% Cell type:markdown id: tags:
If you are training just a single machine learning model without any early stopping or hyperparameter optimization, you can use `bigger_train_queries` as the input.
If you are training a single machine learning model with early stopping or hyperparameter optimization, you can use `train_queries` for training your model and `validation_queries` to stop early or to select the optimal hyperparameters for your model. You can then use `bigger_train_queries` to train the model with the best number of epochs or the best hyperparameters.
If you are training many machine learning models with early stopping or hyperparameter optimization, then you can split your train judgements to smaller training and validation sets. Then, you can use `smaller_train_queries` for training your models, `smaller_validation_queries` to stop early or to select the optimal hyperparameters for your models, and `validation_queries` to select the best model. You can then use `bigger_train_queries` to train the best model with the best number of epochs or the best hyperparameters.
print('\n'.join(repr(query) for query in list(train_queries.values())[:3]))
print('...')
print('\n'.join(repr(query) for query in list(train_queries.values())[-3:]))
```
%% Cell type:code id: tags:
```
query = train_queries[301]
query
```
%% Cell type:code id: tags:
```
print(query.title)
```
%% Cell type:code id: tags:
```
print(query.body)
```
%% Cell type:code id: tags:
```
print(query.narrative)
```
%% Cell type:markdown id: tags:
### Loading the relevance judgements
Next, we will load train and validation relevance judgements into the `train_judgements` and `validation_judgement` sets. Relevance judgements specify, which documents are relevant to which queries. You should use relevance judgements for training your supervised information retrieval system.
%% Cell type:markdown id: tags:
If you are training just a single machine learning model without any early stopping or hyperparameter optimization, you can use `bigger_train_judgements` as the input.
If you are training a single machine learning model with early stopping or hyperparameter optimization, you can use `train_judgements` for training your model and `validation_judgements` to stop early or to select the optimal hyperparameters for your model. You can then use `bigger_train_judgements` to train the model with the best number of epochs or the best hyperparameters.
If you are training many machine learning models with early stopping or hyperparameter optimization, then you can split your train judgements to smaller training and validation sets. Then, you can use `smaller_train_judgements` for training your models, `smaller_validation_judgements` to stop early or to select the optimal hyperparameters for your models, and `validation_judgements` to select the best model. You can then use `bigger_train_judgements` to train the best model with the best number of epochs or the best hyperparameters.
## Implementation of your information retrieval system
Next, we will define a class named `IRSystem` that will represent your information retrieval system. Your class must define a method name `search` that takes a query and returns documents in descending order of relevance to the query.
The example implementation returns documents in decreasing order of the TF-IDF cosine similarity between the document and the query. You can use the example implementation as a basis of your system, or you can replace it with your own implementation.
%% Cell type:code id: tags:
```
from multiprocessing import get_context
from typing import Iterable, Union, List, Tuple
from pv211_utils.trec.irsystem import TrecIRSystemBase
from gensim.corpora import Dictionary
from gensim.matutils import cossim
from gensim.models import TfidfModel
from gensim.similarities import SparseMatrixSimilarity
from gensim.utils import simple_preprocess
from tqdm import tqdm
class IRSystem(TrecIRSystemBase):
"""
A system that returns documents ordered by decreasing cosine similarity.
Attributes
----------
dictionary: Dictionary
The dictionary of the system.
tfidf_model: TfidfModel
The TF-IDF model of the system.
index: MatrixSimilarity
The indexed TF-IDF documents.
index_to_document: dict of (int, Document)
A mapping from indexed document numbers to documents.
Finally, we will evaluate your information retrieval system using [the Mean Average Precision](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision) (MAP) evaluation measure.
%% Cell type:code id: tags:
```
from pv211_utils.trec.leaderboard import TrecLeaderboard