Unverified Commit 2eadf7ff authored by Vít Novotný's avatar Vít Novotný
Browse files

Add ARQMath-2 relevance judgements

parent 594d7850
......@@ -5,11 +5,13 @@ on a number of *tasks*:
- `task1-example`[ARQMath Task1][arqmath-task1] example dataset,
- `task1-votes`[ARQMath Task1][arqmath-task1] Math StackExchange [user votes][],
- `task1`[ARQMath Task1][arqmath-task1] final dataset,
- `task1`, `task1-2020`[ARQMath Task1][arqmath-task1] final dataset,
- `task1-2021`[ARQMath-2 Task1][arqmath-task1] final dataset,
- `ntcir-11-math-2-main`[NTCIR-11 Math-2 Task Main Subtask][ntcir-11-math-2],
- `ntcir-12-mathir-arxiv-main`[NTCIR-12 MathIR Task ArXiv Main Subtask][ntcir-12-mathir], and
- `ntcir-12-mathir-math-wiki-formula`[NTCIR-12 MathIR Task MathWikiFormula Subtask][ntcir-12-mathir].
- `task2`[ARQMath Task2][arqmath-task2] final dataset,
- `task2`, `task2-2020`[ARQMath Task2][arqmath-task2] final dataset,
- `task2-2021`[ARQMath-2 Task2][arqmath-task2] final dataset,
The main tasks are:
......@@ -28,15 +30,34 @@ Each task comes with three *subsets*:
used at the end to compare the systems, which performed best on the
validation set.
The `task1` and `task2` tasks come also with the `all` subset, which contains
The `task1` and `task2` tasks also come with the `all` subset, which contains
all relevance judgements. Use these to evaluate a system that has not been
trained using subsets of the `task1` and `task2` tasks.
The `task1` and `task2` tasks also come with a different subset split used by
the MIRMU and MSM teams in the ARQMath-2 competition submissions. This split is
also used in [the pv211-utils library][pv211-utils]:
- `train-pv211-utils` – The training set, which you can use for supervised
training of your system.
- `validation-pv211-utils` – The validation set, which you can use for
hyperparameter optimization or model selection.
The training set is futher split into the `smaller-train-pv211-utils` and
`smaller-validation` subsets in case you need two validation sets for e.g.
hyperparameter optimization and model selection. If you don't use either
hyperparameter optimization or model selection, you can use the
`bigger-train-pv211-utils` subset, which combines the `train-pv211-utils` and
`validation-pv211-utils` subsets.
- `test-pv211-utils` – The test set, which you currently should only use for
the final performance estimation of your system.
### Examples
#### Using the `train` subset to train your supervised system
``` sh
$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.20
$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.21
$ python
>>> from arqmath_eval import get_topics, get_judged_documents, get_ndcg
>>>
......@@ -65,7 +86,7 @@ Here is the documentation of the available evaluation functions:
#### Using the `validation` subset to compare various parameters of your system
``` sh
$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.20
$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.21
$ python
>>> from arqmath_eval import get_topics, get_judged_documents
>>>
......@@ -96,8 +117,8 @@ $ git push # publish your new result and the upd
#### Using the `all` subset to compute the NDCG' score of an ARQMath submission
``` sh
$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.20
$ python -m arqmath_eval.evaluate MIRMU-task1-Ensemble-auto-both-A.tsv all
$ pip install --force-reinstall git+https://github.com/MIR-MU/ARQMath-eval@0.0.21
$ python -m arqmath_eval.evaluate MIRMU-task1-Ensemble-auto-both-A.tsv all 2020
0.238, 95% CI: [0.198; 0.278]
```
......@@ -113,3 +134,4 @@ $ python -m arqmath_eval.evaluate MIRMU-task1-Ensemble-auto-both-A.tsv all
[ntcir-12-mathir]: https://www.cs.rit.edu/~rlaz/files/ntcir12-mathir.pdf (NTCIR-12 MathIR Task Overview)
[treceval-format]: https://stackoverflow.com/a/8175382/657401 (How to evaluate a search/retrieval engine using trec_eval?)
[user votes]: https://gitlab.fi.muni.cz/xnovot32/arqmath-data-preprocessing/-/blob/master/scripts/xml_to_qrels_tsv.py
[pv211-utils]: https://gitlab.fi.muni.cz/xstefan3/pv211-utils (Utilities for PV211 project)
......@@ -19,8 +19,10 @@ underscores (`_`) replaced with a comma and a space for improved readability.
'''.strip()
RELEVANCE_JUDGEMENTS = {
'train': {
'task1': 'qrel_task1-train.tsv',
'task2': 'qrel_task2-train.tsv',
'task1': 'qrel_task1_2020-train.tsv',
'task1-2020': 'qrel_task1_2020-train.tsv',
'task2': 'qrel_task2_2020-train.tsv',
'task2-2020': 'qrel_task2_2020-train.tsv',
'task1-example': 'qrel.V1.0-train.tsv',
'task1-votes': 'votes-qrels-train.V1.0.tsv',
'ntcir-11-math-2-main': 'NTCIR11_Math-qrels-train.dat',
......@@ -28,20 +30,25 @@ RELEVANCE_JUDGEMENTS = {
'ntcir-12-mathir-math-wiki-formula': 'NTCIR12_MathWikiFrm-qrels_agg-train.dat',
},
'train-pv211-utils': {
'task1': 'qrel_task1-train-pv211-utils.tsv',
'task1': 'qrel_task1_2020-train-pv211-utils.tsv',
'task1-2020': 'qrel_task1_2020-train-pv211-utils.tsv',
},
'smaller-train-pv211-utils': {
'task1': 'qrel_task1-smaller-train-pv211-utils.tsv',
'task1': 'qrel_task1_2020-smaller-train-pv211-utils.tsv',
'task1-2020': 'qrel_task1_2020-smaller-train-pv211-utils.tsv',
},
'bigger-train-pv211-utils': {
'task1': 'qrel_task1-bigger-train-pv211-utils.tsv',
'task1': 'qrel_task1_2020-bigger-train-pv211-utils.tsv',
'task1-2020': 'qrel_task1_2020-bigger-train-pv211-utils.tsv',
},
'small-validation': {
'task1-votes': 'votes-qrels-small-validation.V1.0.tsv',
},
'validation': {
'task1': 'qrel_task1-validation.tsv',
'task2': 'qrel_task2-validation.tsv',
'task1': 'qrel_task1_2020-validation.tsv',
'task1-2020': 'qrel_task1_2020-validation.tsv',
'task2': 'qrel_task2_2020-validation.tsv',
'task2-2020': 'qrel_task2_2020-validation.tsv',
'task1-example': 'qrel.V1.0-validation.tsv',
'task1-votes': 'votes-qrels-validation.V1.0.tsv',
'ntcir-11-math-2-main': 'NTCIR11_Math-qrels-validation.dat',
......@@ -49,14 +56,18 @@ RELEVANCE_JUDGEMENTS = {
'ntcir-12-mathir-math-wiki-formula': 'NTCIR12_MathWikiFrm-qrels_agg-validation.dat',
},
'validation-pv211-utils': {
'task1': 'qrel_task1-validation-pv211-utils.tsv',
'task1': 'qrel_task1_2020-validation-pv211-utils.tsv',
'task1-2020': 'qrel_task1_2020-validation-pv211-utils.tsv',
},
'smaller-validation-pv211-utils': {
'task1': 'qrel_task1-smaller-validation-pv211-utils.tsv',
'task1': 'qrel_task1_2020-smaller-validation-pv211-utils.tsv',
'task1-2020': 'qrel_task1_2020-smaller-validation-pv211-utils.tsv',
},
'test': {
'task1': 'qrel_task1-test.tsv',
'task2': 'qrel_task2-test.tsv',
'task1': 'qrel_task1_2020-test.tsv',
'task1-2020': 'qrel_task1_2020-test.tsv',
'task2': 'qrel_task2_2020-test.tsv',
'task2-2020': 'qrel_task2_2020-test.tsv',
'task1-example': 'qrel.V1.0-test.tsv',
'task1-votes': 'votes-qrels-test.V1.0.tsv',
'ntcir-11-math-2-main': 'NTCIR11_Math-qrels-test.dat',
......@@ -64,11 +75,16 @@ RELEVANCE_JUDGEMENTS = {
'ntcir-12-mathir-math-wiki-formula': 'NTCIR12_MathWikiFrm-qrels_agg-test.dat',
},
'test-pv211-utils': {
'task1': 'qrel_task1-test-pv211-utils.tsv',
'task1': 'qrel_task1_2020-test-pv211-utils.tsv',
'task1-2020': 'qrel_task1_2020-test-pv211-utils.tsv',
},
'all': {
'task1': 'qrel_task1.tsv',
'task2': 'qrel_task2.tsv',
'task1': 'qrel_task1_2020.tsv',
'task1-2020': 'qrel_task1_2020.tsv',
'task1-2021': 'qrel_task1_2021.tsv',
'task2': 'qrel_task2_2020.tsv',
'task2-2020': 'qrel_task2_2020.tsv',
'task2-2021': 'qrel_task2_2021.tsv',
'task1-votes.V1.2': 'votes-qrels.V1.2.tsv',
'task2-topics-formula_ids.V.1.1': 'topics-formula_ids-qrels.V1.1.tsv',
}
......
......@@ -66,15 +66,15 @@ def produce_leaderboards():
f_readme.write('| %.4f | %s | %s |\n' % (ndcg, result_name, user_name))
def evaluate_run(filename, subset, confidence=95.0):
def evaluate_run(filename, subset, year, confidence=95.0):
with open(filename, 'rt') as f:
lines = [line.strip().split() for line in f]
first_line = lines[0]
n = len(first_line)
if n == 5:
task = 'task1'
task = 'task1-{}'.format(year)
elif n == 6:
task = 'task2'
task = 'task2-{}'.format(year)
else:
raise ValueError(
'Expected lines as 5-tuples (Query_Id, Post_Id, Rank, Score, Run_Number) for task 1, '
......@@ -99,8 +99,10 @@ if __name__ == '__main__':
if len(sys.argv) == 1:
produce_leaderboards()
elif len(sys.argv) == 2:
evaluate_run(sys.argv[1], 'all')
evaluate_run(sys.argv[1], 'all', 2020)
elif len(sys.argv) == 3:
evaluate_run(sys.argv[1], sys.argv[2])
evaluate_run(sys.argv[1], sys.argv[2], 2020)
elif len(sys.argv) == 4:
evaluate_run(sys.argv[1], sys.argv[2], int(sys.argv[3]))
else:
raise ValueError("Usage: {} [TSV_FILE [SUBSET]]".format(sys.argv[0]))
raise ValueError("Usage: {} [TSV_FILE [SUBSET [YEAR]]]".format(sys.argv[0]))
This diff is collapsed.
This diff is collapsed.
......@@ -5,7 +5,7 @@ from setuptools import setup
setup(
name='arqmath_eval',
version='0.0.20'
version='0.0.21'
description='Evaluation of ARQMath systems',
packages=['arqmath_eval'],
package_dir={'arqmath_eval': 'scripts'},
......@@ -35,20 +35,22 @@ setup(
'votes-qrels-test.V1.0.tsv',
'votes-qrels.V1.2.tsv',
'topics-formula_ids-qrels.V1.1.tsv',
'qrel_task1-test.tsv',
'qrel_task1-train.tsv',
'qrel_task1.tsv',
'qrel_task1-validation.tsv',
'qrel_task2-test.tsv',
'qrel_task2-train.tsv',
'qrel_task2.tsv',
'qrel_task2-validation.tsv',
'qrel_task1-bigger-train-pv211-utils.tsv',
'qrel_task1-smaller-train-pv211-utils.tsv',
'qrel_task1-smaller-validation-pv211-utils.tsv',
'qrel_task1-test-pv211-utils.tsv',
'qrel_task1-train-pv211-utils.tsv',
'qrel_task1-validation-pv211-utils.tsv',
'qrel_task1_2020-test.tsv',
'qrel_task1_2020-train.tsv',
'qrel_task1_2020.tsv',
'qrel_task1_2020-validation.tsv',
'qrel_task2_2020-test.tsv',
'qrel_task2_2020-train.tsv',
'qrel_task2_2020.tsv',
'qrel_task2_2020-validation.tsv',
'qrel_task1_2020-bigger-train-pv211-utils.tsv',
'qrel_task1_2020-smaller-train-pv211-utils.tsv',
'qrel_task1_2020-smaller-validation-pv211-utils.tsv',
'qrel_task1_2020-test-pv211-utils.tsv',
'qrel_task1_2020-train-pv211-utils.tsv',
'qrel_task1_2020-validation-pv211-utils.tsv',
'qrel_task1_2021.tsv',
'qrel_task2_2021.tsv',
],
},
include_package_data=True,
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment