get_ndcg() produces results far from arqmath_eval.evaluate

I expect to get somewhat similar results when running scripts 1. and 2.:

setup:

import pickle
results = pickle.load("results_dict.pkl")  # from attachment

get_ndcg(results, task='task1-votes', subset='small-validation', topn=1000)

outputs 0.0

Edit: 1.1

get_ndcg(results, task='task1-votes', subset='validation', topn=1000)

on documents from task='task1-votes', subset='validation' also outputs 0.0.

def report_ndcg_results(result_tsv_name: str, results: dict):
    with open(result_tsv_name, 'wt') as f:
        for topic, documents in results.items():
            top_documents = sorted(documents.items(), key=lambda x: x[1], reverse=True)[:1000]
            for rank, (document, similarity_score) in enumerate(top_documents):
                line = '{}\txxx\t{}\t{}\t{}\txxx'.format(topic, document, rank + 1, similarity_score)
                print(line, file=f)
report_ndcg_results("task1-votes/xstefan3/sbert_small_validation_blank_v1.0.tsv", results)

then

git add task1-votes/xstefan3/sbert_small_validation_blank_v1.0.tsv
python -m arqmath_eval.evaluate
cat task1-votes/xstefan3/README.md | grep small

outputs | 0.7655 | sbert, small, validation, blank, v1.0 |

If relevant, I am getting the results like this (so you see that I am using the correct validation-small subset):

results = {}
all_questions_ids = get_topics(task=self.task, subset=self.subset)
all_questions = dict([(int(qid), self.post_parser.map_questions[int(qid)]) for qid in all_questions_ids])

for i, (qid, question) in enumerate(all_questions.items()):
    results[qid] = {}
    judged_answer_ids = get_judged_documents(task=self.task, subset=self.subset, topic=str(qid))
    question_e = self.model.encode([question])
    
    answers_bodies = [self.post_parser.map_just_answers[int(aid)].body for aid in judged_answer_ids]

    if not answers_bodies:
        print("No evaluated answers for question %s, dtype %s" % (qid, str(type(qid))))
        continue
    answers_e = self.model.encode(answers_bodies, batch_size=8)
    answers_dists = cosine_similarity(np.array(question_e), np.array(answers_e))[0]

    for aid, answer_sim in sorted(zip(judged_answer_ids, answers_dists), key=lambda qid_dist: qid_dist[1], reverse=True):
        results[qid][aid] = float(answer_sim)

AFAIK, I am comparing the results of validation and small-validation, but that does not explain, why validation-small (a subset of validation) does not work at all. I am suspicious of an implementation flaw in get_ndcg, that I need for validations (and am eventually not using Turnus03 for a while already).

Edited May 23, 2020 by Michal Štefánik