Loading README.md +17 −3 Original line number Diff line number Diff line # ARQMath-eval This repository evaluates the performance of your information retrieval system on a number of *tasks*: Evaluation of the two methods of ARQMath 2020 competition: - task1 -- [ARQMath Task1: Find Answers][arqmath-task1] 1. Answer retrieval 2. Formula search No newline at end of file Place your results in [the trec\_eval format][treceval-format] into your dedicated directory *task/user*. To evaluate and publish your results, execute the following commands: ``` sh $ git add task/user/result.tsv # track your new result with Git $ pip install -r requirements.txt # run the evaluation $ python -m evaluate $ git add -u # add the updated leaderboard to Git $ git push # publish your new result and the updated leaderboard ``` [arqmath-task1]: https://www.cs.rit.edu/~dprl/ARQMath/Task1-answers.html (Task 1: Find Answers) [treceval-format]: https://stackoverflow.com/a/8175382/657401 (How to evaluate a search/retrieval engine using trec_eval?) evaluate.py 0 → 100644 +63 −0 Original line number Diff line number Diff line # -*- coding:utf-8 -*- from glob import glob import os.path import re import numpy as np from pytrec_eval import RelevanceEvaluator, parse_qrel, parse_run TASKS = ['task1'] RELEVANCE_JUDGEMENTS = { 'task1': 'qrel.V0.1.tsv', } TASK_README_HEAD = r''' This table contains the best result for every user. | nDCG | User | Result name | |:-----|------|:------------| '''.strip() USER_README_HEAD = r''' This table contains all results for $USER in descending order of task performance. Result names are based on the filenames of the results with underscores (`_`) replaced with a comma and a space for improved readability. | nDCG | Result name | |------|:------------| '''.strip() if __name__ == '__main__': for task in TASKS: with open(os.path.join(task, RELEVANCE_JUDGEMENTS[task]), 'rt') as f: parsed_relevance_judgements = parse_qrel(f) evaluator = RelevanceEvaluator(parsed_relevance_judgements, {'ndcg'}) task_results = [] for user in glob(os.path.join(task, '*', '')): user = os.path.normpath(user) user_name = os.path.basename(user) user_results = [] for result in glob(os.path.join(user, '*.tsv')): result_name = re.sub('_', ', ', os.path.basename(result)[:-4]) with open(result, 'rt') as f: parsed_result = parse_run(f) evaluation = evaluator.evaluate(parsed_result) ndcg = np.mean([ measures['ndcg'] for topic, measures in evaluation.items() ]) user_results.append((ndcg, result_name)) best_ndcg, best_result_name = max(user_results) task_results.append((best_ndcg, user_name, best_result_name)) with open(os.path.join(user, 'README.md'), 'wt') as f: f.write(USER_README_HEAD) f.write('\n') for ndgc, result_name in sorted(user_results, reverse=True): f.write('| %.4f | %s |\n' % (ndcg, result_name)) with open(os.path.join(task, 'README.md'), 'wt') as f: f.write(TASK_README_HEAD) f.write('\n') for ndgc, user_name, result_name in sorted(task_results, reverse=True): f.write('| %.4f | %s | %s |\n' % (ndcg, user_name, result_name)) evaluate.shdeleted 100755 → 0 +0 −64 Original line number Diff line number Diff line #!/bin/bash set -e shopt -s nullglob if [[ ! -e trec_eval ]] then git clone https://github.com/usnistgov/trec_eval make -j -C trec_eval fi cd task1 # summary task 1 table header cat > README-head.md << EOT This table contains the best results for every user. | User | nDCG | Result name | |:-----|------|:------------| EOT for USER in */ do cd $USER # per-user task 1 table header cat > README-head.md << EOT This table contains all results for $USER in descending order of task performance. Result names are based on the filenames of the results with underscores (\`_\`) replaced with a comma and a space for improved readability. | nDCG | Result name | |------|:------------| EOT for RESULT in *.tsv do NDCG=$(../../trec_eval/trec_eval ../qrel.V0.1.tsv "$RESULT" -m ndcg | awk '{ print $3 }') # per-user task 1 table entries cat >> README-tail.md << EOT | $NDCG | $(printf '%s\n' "${RESULT%.tsv}" | sed 's/_/, /g') | EOT done (cat README-head.md && LC_ALL=C sort -k 2 -k 4 README-tail.md | tee >( # summary task 1 table header head -1 | while read LINE do printf '%s%s\n' "| [${USER%/}]($USER) " "$LINE" done >> ../README-tail.md )) > README.md rm README-head.md README-tail.md git add README.md cd .. done (cat README-head.md && LC_ALL=C sort -k 4 -k 2 -k 6 README-tail.md) > README.md rm README-head.md README-tail.md git add README.md cd .. if ! git diff --staged --quiet then git commit -m 'Update result tables' --quiet if ! git push --quiet then git fetch git rebase master printf 'Failed to git push\n >&2' exit 1 fi fi task1/README.md +6 −6 Original line number Diff line number Diff line This table contains the best results for every user. This table contains the best result for every user. | User | nDCG | Result name | | nDCG | User | Result name | |:-----|------|:------------| | [ayetiran](ayetiran/) | 0.5181 | example, key1=value1, key2=value2, etc | | [xluptak4](xluptak4/) | 0.5181 | example, key1=value1, key2=value2, etc | | [xnovot32](xnovot32/) | 0.5181 | example, key1=value1, key2=value2, etc | | [xstefan3](xstefan3/) | 0.5181 | example, key1=value1, key2=value2, etc | | 0.5181 | xstefan3 | example, key1=value1, key2=value2, etc | | 0.5181 | xnovot32 | example, key1=value1, key2=value2, etc | | 0.5181 | xluptak4 | example, key1=value1, key2=value2, etc | | 0.5181 | ayetiran | example, key1=value1, key2=value2, etc | task1/ayetiran/README.md +3 −2 Original line number Diff line number Diff line This table contains all results for ayetiran/ in descending order of task performance. Result names are based on the filenames of the results with underscores (`_`) replaced with a comma and a space for improved readability. This table contains all results for $USER in descending order of task performance. Result names are based on the filenames of the results with underscores (`_`) replaced with a comma and a space for improved readability. | nDCG | Result name | |------|:------------| Loading Loading
README.md +17 −3 Original line number Diff line number Diff line # ARQMath-eval This repository evaluates the performance of your information retrieval system on a number of *tasks*: Evaluation of the two methods of ARQMath 2020 competition: - task1 -- [ARQMath Task1: Find Answers][arqmath-task1] 1. Answer retrieval 2. Formula search No newline at end of file Place your results in [the trec\_eval format][treceval-format] into your dedicated directory *task/user*. To evaluate and publish your results, execute the following commands: ``` sh $ git add task/user/result.tsv # track your new result with Git $ pip install -r requirements.txt # run the evaluation $ python -m evaluate $ git add -u # add the updated leaderboard to Git $ git push # publish your new result and the updated leaderboard ``` [arqmath-task1]: https://www.cs.rit.edu/~dprl/ARQMath/Task1-answers.html (Task 1: Find Answers) [treceval-format]: https://stackoverflow.com/a/8175382/657401 (How to evaluate a search/retrieval engine using trec_eval?)
evaluate.py 0 → 100644 +63 −0 Original line number Diff line number Diff line # -*- coding:utf-8 -*- from glob import glob import os.path import re import numpy as np from pytrec_eval import RelevanceEvaluator, parse_qrel, parse_run TASKS = ['task1'] RELEVANCE_JUDGEMENTS = { 'task1': 'qrel.V0.1.tsv', } TASK_README_HEAD = r''' This table contains the best result for every user. | nDCG | User | Result name | |:-----|------|:------------| '''.strip() USER_README_HEAD = r''' This table contains all results for $USER in descending order of task performance. Result names are based on the filenames of the results with underscores (`_`) replaced with a comma and a space for improved readability. | nDCG | Result name | |------|:------------| '''.strip() if __name__ == '__main__': for task in TASKS: with open(os.path.join(task, RELEVANCE_JUDGEMENTS[task]), 'rt') as f: parsed_relevance_judgements = parse_qrel(f) evaluator = RelevanceEvaluator(parsed_relevance_judgements, {'ndcg'}) task_results = [] for user in glob(os.path.join(task, '*', '')): user = os.path.normpath(user) user_name = os.path.basename(user) user_results = [] for result in glob(os.path.join(user, '*.tsv')): result_name = re.sub('_', ', ', os.path.basename(result)[:-4]) with open(result, 'rt') as f: parsed_result = parse_run(f) evaluation = evaluator.evaluate(parsed_result) ndcg = np.mean([ measures['ndcg'] for topic, measures in evaluation.items() ]) user_results.append((ndcg, result_name)) best_ndcg, best_result_name = max(user_results) task_results.append((best_ndcg, user_name, best_result_name)) with open(os.path.join(user, 'README.md'), 'wt') as f: f.write(USER_README_HEAD) f.write('\n') for ndgc, result_name in sorted(user_results, reverse=True): f.write('| %.4f | %s |\n' % (ndcg, result_name)) with open(os.path.join(task, 'README.md'), 'wt') as f: f.write(TASK_README_HEAD) f.write('\n') for ndgc, user_name, result_name in sorted(task_results, reverse=True): f.write('| %.4f | %s | %s |\n' % (ndcg, user_name, result_name))
evaluate.shdeleted 100755 → 0 +0 −64 Original line number Diff line number Diff line #!/bin/bash set -e shopt -s nullglob if [[ ! -e trec_eval ]] then git clone https://github.com/usnistgov/trec_eval make -j -C trec_eval fi cd task1 # summary task 1 table header cat > README-head.md << EOT This table contains the best results for every user. | User | nDCG | Result name | |:-----|------|:------------| EOT for USER in */ do cd $USER # per-user task 1 table header cat > README-head.md << EOT This table contains all results for $USER in descending order of task performance. Result names are based on the filenames of the results with underscores (\`_\`) replaced with a comma and a space for improved readability. | nDCG | Result name | |------|:------------| EOT for RESULT in *.tsv do NDCG=$(../../trec_eval/trec_eval ../qrel.V0.1.tsv "$RESULT" -m ndcg | awk '{ print $3 }') # per-user task 1 table entries cat >> README-tail.md << EOT | $NDCG | $(printf '%s\n' "${RESULT%.tsv}" | sed 's/_/, /g') | EOT done (cat README-head.md && LC_ALL=C sort -k 2 -k 4 README-tail.md | tee >( # summary task 1 table header head -1 | while read LINE do printf '%s%s\n' "| [${USER%/}]($USER) " "$LINE" done >> ../README-tail.md )) > README.md rm README-head.md README-tail.md git add README.md cd .. done (cat README-head.md && LC_ALL=C sort -k 4 -k 2 -k 6 README-tail.md) > README.md rm README-head.md README-tail.md git add README.md cd .. if ! git diff --staged --quiet then git commit -m 'Update result tables' --quiet if ! git push --quiet then git fetch git rebase master printf 'Failed to git push\n >&2' exit 1 fi fi
task1/README.md +6 −6 Original line number Diff line number Diff line This table contains the best results for every user. This table contains the best result for every user. | User | nDCG | Result name | | nDCG | User | Result name | |:-----|------|:------------| | [ayetiran](ayetiran/) | 0.5181 | example, key1=value1, key2=value2, etc | | [xluptak4](xluptak4/) | 0.5181 | example, key1=value1, key2=value2, etc | | [xnovot32](xnovot32/) | 0.5181 | example, key1=value1, key2=value2, etc | | [xstefan3](xstefan3/) | 0.5181 | example, key1=value1, key2=value2, etc | | 0.5181 | xstefan3 | example, key1=value1, key2=value2, etc | | 0.5181 | xnovot32 | example, key1=value1, key2=value2, etc | | 0.5181 | xluptak4 | example, key1=value1, key2=value2, etc | | 0.5181 | ayetiran | example, key1=value1, key2=value2, etc |
task1/ayetiran/README.md +3 −2 Original line number Diff line number Diff line This table contains all results for ayetiran/ in descending order of task performance. Result names are based on the filenames of the results with underscores (`_`) replaced with a comma and a space for improved readability. This table contains all results for $USER in descending order of task performance. Result names are based on the filenames of the results with underscores (`_`) replaced with a comma and a space for improved readability. | nDCG | Result name | |------|:------------| Loading