Commit 97b2d1a0 authored by Vít Novotný's avatar Vít Novotný
Browse files

Evaluate math concatenation

parent 35b847ed
Pipeline #60694 canceled with stage
The system recogizes the following parameters:
The [SCM system][scm-at-arqmath] recogizes the following parameters:
- Dataset:
- arxmliv, 08, 2019, no-problem – the no\_problem subset (150,701 documents) of [the arXMLiv 08.2019 dataset][arxmliv-08-2019]
- phrases – whether phrases are modeled
- concat-math – whether adjacent math tokens are contatenated into mathematical expressions
- phrases – how many times [collocation detection][] and bigram merging are iteratively applied to the corpus:
- 0 – the text and math tokens in the corpus are unchanged,
- N – [collocation detection][] and bigram merging are iteratively applied to both text and math tokens in the corpus N times
- Math representation:
- opt – paths in operator tree
- slt – paths in syntax layout tree
......@@ -16,13 +19,13 @@ The system recogizes the following parameters:
- iter – the number of epochs
- min-alpha – minimum learning rate
- min-n, max-n – the range of modeled subword sizes
- min-count – minimum term frequency
- min-count – the minimum term frequency
- negative – the number of negative samples
- sample – sampling threshold
- sg – the skipgram model
- size – vector dimensions
- window – window size
- workers – the number of threads used in HogWild
- workers – the number of threads used for [hogwild][]
- Soft Cosine Measure:
- dominant – whether the term similarity matrix will be strongly diagonally dominant
- nonzero-limit – the maximum number of non-zero elements outside the diagonal in a single column of the term similarity matrix
......@@ -31,4 +34,7 @@ The system recogizes the following parameters:
- threshold – parameter *t* in the [term similarity matrix formula][]
[arxmliv-08-2019]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/
[collocation detection]: https://radimrehurek.com/gensim/models/phrases.html
[hogwild]: https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent
[scm-at-arqmath]: https://gitlab.fi.muni.cz/xnovot32/scm-at-arqmath (Soft Cosine Measure at ARQMath)
[term similarity matrix formula]: https://arxiv.org/pdf/2003.05019.pdf#page=4
......@@ -4,21 +4,25 @@ underscores (`_`) replaced with a comma and a space for improved readability.
| nDCG | Result name |
|------|:------------|
| 0.7613 | arxmliv, infix, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7612 | arxmliv, prefix, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7607 | arxmliv, slt, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7606 | arxmliv, opt, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7602 | arxmliv, latex, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7600 | arxmliv, nomath, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7613 | infix, concat-math=False, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7612 | prefix, concat-math=False, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7607 | slt, concat-math=False, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7606 | opt, concat-math=False, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7603 | infix, concat-math=True, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7602 | latex, concat-math=False, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7600 | nomath, concat-math=False, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| *0.7578* | *random* |
## Legend
The system recogizes the following parameters:
The [SCM system][scm-at-arqmath] recogizes the following parameters:
- Dataset:
- arxmliv, 08, 2019, no-problem – the no\_problem subset (150,701 documents) of [the arXMLiv 08.2019 dataset][arxmliv-08-2019]
- phrases – whether phrases are modeled
- concat-math – whether adjacent math tokens are contatenated into mathematical expressions
- phrases – how many times [collocation detection][] and bigram merging are iteratively applied to the corpus:
- 0 – the text and math tokens in the corpus are unchanged,
- N – [collocation detection][] and bigram merging are iteratively applied to both text and math tokens in the corpus N times
- Math representation:
- opt – paths in operator tree
- slt – paths in syntax layout tree
......@@ -32,13 +36,13 @@ The system recogizes the following parameters:
- iter – the number of epochs
- min-alpha – minimum learning rate
- min-n, max-n – the range of modeled subword sizes
- min-count – minimum term frequency
- min-count – the minimum term frequency
- negative – the number of negative samples
- sample – sampling threshold
- sg – the skipgram model
- size – vector dimensions
- window – window size
- workers – the number of threads used in HogWild
- workers – the number of threads used for [hogwild][]
- Soft Cosine Measure:
- dominant – whether the term similarity matrix will be strongly diagonally dominant
- nonzero-limit – the maximum number of non-zero elements outside the diagonal in a single column of the term similarity matrix
......@@ -47,4 +51,7 @@ The system recogizes the following parameters:
- threshold – parameter *t* in the [term similarity matrix formula][]
[arxmliv-08-2019]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/
[collocation detection]: https://radimrehurek.com/gensim/models/phrases.html
[hogwild]: https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent
[scm-at-arqmath]: https://gitlab.fi.muni.cz/xnovot32/scm-at-arqmath (Soft Cosine Measure at ARQMath)
[term similarity matrix formula]: https://arxiv.org/pdf/2003.05019.pdf#page=4
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment