Unverified Commit aa66ec64 authored by Vít Starý Novotný's avatar Vít Starý Novotný
Browse files

Reimplement in python

parent faa2041c
Loading
Loading
Loading
Loading
Loading
+17 −3
Original line number Diff line number Diff line
# ARQMath-eval
This repository evaluates the performance of your information retrieval system
on a number of *tasks*:

Evaluation of the two methods of ARQMath 2020 competition:
- task1 -- [ARQMath Task1: Find Answers][arqmath-task1]

1. Answer retrieval
2. Formula search
 No newline at end of file
Place your results in [the trec\_eval format][treceval-format] into your
dedicated directory *task/user*. To evaluate and publish your results,
execute the following commands:

``` sh
$ git add task/user/result.tsv     # track your new result with Git
$ pip install -r requirements.txt  # run the evaluation
$ python -m evaluate
$ git add -u                       # add the updated leaderboard to Git
$ git push                         # publish your new result and the updated leaderboard
```

 [arqmath-task1]:   https://www.cs.rit.edu/~dprl/ARQMath/Task1-answers.html (Task 1: Find Answers)
 [treceval-format]: https://stackoverflow.com/a/8175382/657401 (How to evaluate a search/retrieval engine using trec_eval?)

evaluate.py

0 → 100644
+63 −0
Original line number Diff line number Diff line
# -*- coding:utf-8 -*-

from glob import glob
import os.path
import re

import numpy as np
from pytrec_eval import RelevanceEvaluator, parse_qrel, parse_run


TASKS = ['task1']
RELEVANCE_JUDGEMENTS = {
    'task1': 'qrel.V0.1.tsv',      
}
TASK_README_HEAD = r'''
This table contains the best result for every user.

| nDCG | User | Result name |
|:-----|------|:------------|
'''.strip()
USER_README_HEAD = r'''
This table contains all results for $USER in descending order of task
performance.  Result names are based on the filenames of the results with
underscores (`_`) replaced with a comma and a space for improved readability.

| nDCG | Result name |
|------|:------------|
'''.strip()


if __name__ == '__main__':
    for task in TASKS:
        with open(os.path.join(task, RELEVANCE_JUDGEMENTS[task]), 'rt') as f:
            parsed_relevance_judgements = parse_qrel(f)
        evaluator = RelevanceEvaluator(parsed_relevance_judgements, {'ndcg'})
        task_results = []
        for user in glob(os.path.join(task, '*', '')):
            user = os.path.normpath(user)
            user_name = os.path.basename(user)
            user_results = []
            for result in glob(os.path.join(user, '*.tsv')):
                result_name = re.sub('_', ', ', os.path.basename(result)[:-4])
                with open(result, 'rt') as f:
                    parsed_result = parse_run(f)
                evaluation = evaluator.evaluate(parsed_result)
                ndcg = np.mean([
                    measures['ndcg']
                    for topic, measures
                    in evaluation.items()
                ])
                user_results.append((ndcg, result_name))
            best_ndcg, best_result_name = max(user_results)
            task_results.append((best_ndcg, user_name, best_result_name))
            with open(os.path.join(user, 'README.md'), 'wt') as f:
                f.write(USER_README_HEAD)
                f.write('\n')
                for ndgc, result_name in sorted(user_results, reverse=True):
                    f.write('| %.4f | %s |\n' % (ndcg, result_name))
        with open(os.path.join(task, 'README.md'), 'wt') as f:
            f.write(TASK_README_HEAD)
            f.write('\n')
            for ndgc, user_name, result_name in sorted(task_results, reverse=True):
                f.write('| %.4f | %s | %s |\n' % (ndcg, user_name, result_name))

evaluate.sh

deleted100755 → 0
+0 −64
Original line number Diff line number Diff line
#!/bin/bash
set -e
shopt -s nullglob

if [[ ! -e trec_eval ]]
then
  git clone https://github.com/usnistgov/trec_eval
  make -j -C trec_eval
fi

cd task1
# summary task 1 table header
cat > README-head.md << EOT
This table contains the best results for every user.

| User | nDCG | Result name |
|:-----|------|:------------|
EOT
for USER in */
do
  cd $USER
  # per-user task 1 table header
  cat > README-head.md << EOT
This table contains all results for $USER in descending order of task performance.  
Result names are based on the filenames of the results with underscores (\`_\`) replaced with a comma and a space for improved readability.

| nDCG | Result name |
|------|:------------|
EOT
  for RESULT in *.tsv
  do
    NDCG=$(../../trec_eval/trec_eval ../qrel.V0.1.tsv "$RESULT" -m ndcg | awk '{ print $3 }')
    # per-user task 1 table entries
    cat >> README-tail.md << EOT
| $NDCG | $(printf '%s\n' "${RESULT%.tsv}" | sed 's/_/, /g') |
EOT
  done
  (cat README-head.md && LC_ALL=C sort -k 2 -k 4 README-tail.md | tee >(
    # summary task 1 table header
    head -1 | while read LINE
    do
      printf '%s%s\n' "| [${USER%/}]($USER) " "$LINE"
    done >> ../README-tail.md
  )) > README.md
  rm README-head.md README-tail.md
  git add README.md
  cd ..
done
(cat README-head.md && LC_ALL=C sort -k 4 -k 2 -k 6 README-tail.md) > README.md
rm README-head.md README-tail.md
git add README.md
cd ..

if ! git diff --staged --quiet
then
  git commit -m 'Update result tables' --quiet
  if ! git push --quiet
  then
    git fetch
    git rebase master
    printf 'Failed to git push\n >&2'
    exit 1
  fi
fi
+6 −6
Original line number Diff line number Diff line
This table contains the best results for every user.
This table contains the best result for every user.

| User | nDCG | Result name |
| nDCG | User | Result name |
|:-----|------|:------------|
| [ayetiran](ayetiran/) | 0.5181 | example, key1=value1, key2=value2, etc |
| [xluptak4](xluptak4/) | 0.5181 | example, key1=value1, key2=value2, etc |
| [xnovot32](xnovot32/) | 0.5181 | example, key1=value1, key2=value2, etc |
| [xstefan3](xstefan3/) | 0.5181 | example, key1=value1, key2=value2, etc |
| 0.5181 | xstefan3 | example, key1=value1, key2=value2, etc |
| 0.5181 | xnovot32 | example, key1=value1, key2=value2, etc |
| 0.5181 | xluptak4 | example, key1=value1, key2=value2, etc |
| 0.5181 | ayetiran | example, key1=value1, key2=value2, etc |
+3 −2
Original line number Diff line number Diff line
This table contains all results for ayetiran/ in descending order of task performance.  
Result names are based on the filenames of the results with underscores (`_`) replaced with a comma and a space for improved readability.
This table contains all results for $USER in descending order of task
performance.  Result names are based on the filenames of the results with
underscores (`_`) replaced with a comma and a space for improved readability.

| nDCG | Result name |
|------|:------------|
Loading