get_ndcg() produces results far from arqmath_eval.evaluate
I expect to get somewhat similar results when running scripts 1. and 2.:
- setup:
Download results_dict.pkl
import pickle
results = pickle.load("results_dict.pkl") # from attachment
get_ndcg(results, task='task1-votes', subset='small-validation', topn=1000)
outputs 0.0
Edit: 1.1
get_ndcg(results, task='task1-votes', subset='validation', topn=1000)
on documents from task='task1-votes'
, subset='validation'
also outputs 0.0
.
def report_ndcg_results(result_tsv_name: str, results: dict):
with open(result_tsv_name, 'wt') as f:
for topic, documents in results.items():
top_documents = sorted(documents.items(), key=lambda x: x[1], reverse=True)[:1000]
for rank, (document, similarity_score) in enumerate(top_documents):
line = '{}\txxx\t{}\t{}\t{}\txxx'.format(topic, document, rank + 1, similarity_score)
print(line, file=f)
report_ndcg_results("task1-votes/xstefan3/sbert_small_validation_blank_v1.0.tsv", results)
then
git add task1-votes/xstefan3/sbert_small_validation_blank_v1.0.tsv
python -m arqmath_eval.evaluate
cat task1-votes/xstefan3/README.md | grep small
outputs | 0.7655 | sbert, small, validation, blank, v1.0 |
If relevant, I am getting the results
like this (so you see that I am using the correct validation-small subset):
results = {}
all_questions_ids = get_topics(task=self.task, subset=self.subset)
all_questions = dict([(int(qid), self.post_parser.map_questions[int(qid)]) for qid in all_questions_ids])
for i, (qid, question) in enumerate(all_questions.items()):
results[qid] = {}
judged_answer_ids = get_judged_documents(task=self.task, subset=self.subset, topic=str(qid))
question_e = self.model.encode([question])
answers_bodies = [self.post_parser.map_just_answers[int(aid)].body for aid in judged_answer_ids]
if not answers_bodies:
print("No evaluated answers for question %s, dtype %s" % (qid, str(type(qid))))
continue
answers_e = self.model.encode(answers_bodies, batch_size=8)
answers_dists = cosine_similarity(np.array(question_e), np.array(answers_e))[0]
for aid, answer_sim in sorted(zip(judged_answer_ids, answers_dists), key=lambda qid_dist: qid_dist[1], reverse=True):
results[qid][aid] = float(answer_sim)
AFAIK, I am comparing the results of validation
and small-validation
, but that does not explain, why validation-small
(a subset of validation
) does not work at all. I am suspicious of an implementation flaw in get_ndcg
, that I need for validations (and am eventually not using Turnus03 for a while already).