Commit 08cbfc81 authored by stefanik12's avatar stefanik12
Browse files

xstefan3 readme merge

parents 9240fcd6 ae012ba8
Loading
Loading
Loading
Loading
Loading
+9 −4
Original line number Original line Diff line number Diff line
The system recogizes the following parameters:
The [SCM system][scm-at-arqmath] recogizes the following parameters:


- Dataset:
- Dataset:
  - arxmliv, 08, 2019, no-problem – the no\_problem subset (150,701 documents) of [the arXMLiv 08.2019 dataset][arxmliv-08-2019]
  - arxmliv, 08, 2019, no-problem – the no\_problem subset (150,701 documents) of [the arXMLiv 08.2019 dataset][arxmliv-08-2019]
  - phrases – whether phrases are modeled
  - phrases – how many times [collocation detection][] and bigram merging are iteratively applied to the corpus:
    - 0 – the text and math tokens in the corpus are unchanged,
    - N –  [collocation detection][] and bigram merging are iteratively applied to both text and math tokens in the corpus N times
- Math representation:
- Math representation:
  - opt – paths in operator tree
  - opt – paths in operator tree
  - slt – paths in syntax layout tree
  - slt – paths in syntax layout tree
@@ -16,13 +18,13 @@ The system recogizes the following parameters:
  - iter – the number of epochs
  - iter – the number of epochs
  - min-alpha – minimum learning rate
  - min-alpha – minimum learning rate
  - min-n, max-n – the range of modeled subword sizes
  - min-n, max-n – the range of modeled subword sizes
  - min-count – minimum term frequency
  - min-count – the minimum term frequency
  - negative – the number of negative samples
  - negative – the number of negative samples
  - sample – sampling threshold
  - sample – sampling threshold
  - sg – the skipgram model
  - sg – the skipgram model
  - size – vector dimensions
  - size – vector dimensions
  - window – window size
  - window – window size
  - workers – the number of threads used in HogWild
  - workers – the number of threads used for [hogwild][]
- Soft Cosine Measure:
- Soft Cosine Measure:
  - dominant – whether the term similarity matrix will be strongly diagonally dominant
  - dominant – whether the term similarity matrix will be strongly diagonally dominant
  - nonzero-limit – the maximum number of non-zero elements outside the diagonal in a single column of the term similarity matrix
  - nonzero-limit – the maximum number of non-zero elements outside the diagonal in a single column of the term similarity matrix
@@ -31,4 +33,7 @@ The system recogizes the following parameters:
  - threshold – parameter *t* in the [term similarity matrix formula][]
  - threshold – parameter *t* in the [term similarity matrix formula][]


 [arxmliv-08-2019]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/
 [arxmliv-08-2019]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/
 [collocation detection]: https://radimrehurek.com/gensim/models/phrases.html
 [hogwild]: https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent
 [scm-at-arqmath]: https://gitlab.fi.muni.cz/xnovot32/scm-at-arqmath (Soft Cosine Measure at ARQMath)
 [term similarity matrix formula]: https://arxiv.org/pdf/2003.05019.pdf#page=4
 [term similarity matrix formula]: https://arxiv.org/pdf/2003.05019.pdf#page=4
+16 −10
Original line number Original line Diff line number Diff line
@@ -4,21 +4,24 @@ underscores (`_`) replaced with a comma and a space for improved readability.


| nDCG | Result name |
| nDCG | Result name |
|------|:------------|
|------|:------------|
| 0.7613 | arxmliv, infix, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7614 | infix, phrases=1, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7612 | arxmliv, prefix, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7613 | infix, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7607 | arxmliv, slt, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7612 | prefix, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7606 | arxmliv, opt, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7607 | slt, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7602 | arxmliv, latex, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7606 | opt, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7600 | arxmliv, nomath, 08, 2019, no-problem, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7602 | latex, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| 0.7600 | nomath, phrases=0, alpha=0.05, bucket=2000000, iter=5, max-n=6, min-alpha=0, min-count=5, min-n=3, negative=5, sample=0.0001, sg=1, size=300, window=5, workers=64, dominant=True, nonzero-limit=100, symmetric=True, exponent=4.0, threshold=-1.0 |
| *0.7578* | *random* |
| *0.7578* | *random* |


## Legend
## Legend


The system recogizes the following parameters:
The [SCM system][scm-at-arqmath] recogizes the following parameters:


- Dataset:
- Dataset:
  - arxmliv, 08, 2019, no-problem – the no\_problem subset (150,701 documents) of [the arXMLiv 08.2019 dataset][arxmliv-08-2019]
  - arxmliv, 08, 2019, no-problem – the no\_problem subset (150,701 documents) of [the arXMLiv 08.2019 dataset][arxmliv-08-2019]
  - phrases – whether phrases are modeled
  - phrases – how many times [collocation detection][] and bigram merging are iteratively applied to the corpus:
    - 0 – the text and math tokens in the corpus are unchanged,
    - N –  [collocation detection][] and bigram merging are iteratively applied to both text and math tokens in the corpus N times
- Math representation:
- Math representation:
  - opt – paths in operator tree
  - opt – paths in operator tree
  - slt – paths in syntax layout tree
  - slt – paths in syntax layout tree
@@ -32,13 +35,13 @@ The system recogizes the following parameters:
  - iter – the number of epochs
  - iter – the number of epochs
  - min-alpha – minimum learning rate
  - min-alpha – minimum learning rate
  - min-n, max-n – the range of modeled subword sizes
  - min-n, max-n – the range of modeled subword sizes
  - min-count – minimum term frequency
  - min-count – the minimum term frequency
  - negative – the number of negative samples
  - negative – the number of negative samples
  - sample – sampling threshold
  - sample – sampling threshold
  - sg – the skipgram model
  - sg – the skipgram model
  - size – vector dimensions
  - size – vector dimensions
  - window – window size
  - window – window size
  - workers – the number of threads used in HogWild
  - workers – the number of threads used for [hogwild][]
- Soft Cosine Measure:
- Soft Cosine Measure:
  - dominant – whether the term similarity matrix will be strongly diagonally dominant
  - dominant – whether the term similarity matrix will be strongly diagonally dominant
  - nonzero-limit – the maximum number of non-zero elements outside the diagonal in a single column of the term similarity matrix
  - nonzero-limit – the maximum number of non-zero elements outside the diagonal in a single column of the term similarity matrix
@@ -47,4 +50,7 @@ The system recogizes the following parameters:
  - threshold – parameter *t* in the [term similarity matrix formula][]
  - threshold – parameter *t* in the [term similarity matrix formula][]


 [arxmliv-08-2019]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/
 [arxmliv-08-2019]: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082019/
 [collocation detection]: https://radimrehurek.com/gensim/models/phrases.html
 [hogwild]: https://papers.nips.cc/paper/4390-hogwild-a-lock-free-approach-to-parallelizing-stochastic-gradient-descent
 [scm-at-arqmath]: https://gitlab.fi.muni.cz/xnovot32/scm-at-arqmath (Soft Cosine Measure at ARQMath)
 [term similarity matrix formula]: https://arxiv.org/pdf/2003.05019.pdf#page=4
 [term similarity matrix formula]: https://arxiv.org/pdf/2003.05019.pdf#page=4
+0 −0

File moved.

+129381 −0

File added.

Preview size limit exceeded, changes collapsed.

+0 −0

File moved.

Loading