Difference between revisions of "Vectors/metadata"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Algorithms)
Line 1: Line 1:
 
This page describes the fields in the [http://vectors.nlpl.eu/repository/ NLPL vector repository] catalogue. The catalogue itself is a <tt>JSON</tt> file, for example, <tt>20.json</tt> for the version 2.0 of the Repository.
 
This page describes the fields in the [http://vectors.nlpl.eu/repository/ NLPL vector repository] catalogue. The catalogue itself is a <tt>JSON</tt> file, for example, <tt>20.json</tt> for the version 2.0 of the Repository.
  
All field except "id" are optional.
+
All field except <tt>"id"</tt> are optional.
  
 
== Algorithms ==
 
== Algorithms ==
 +
This section lists the distributional algorithms.
 +
 
* "'''command'''": exact command which was run to train the model, for example, <tt>"word2vec -min-count 10 -size 100 -window 10 -negative 5 -iter 2 -threads 16 -cbow 0 -binary 0"</tt>
 
* "'''command'''": exact command which was run to train the model, for example, <tt>"word2vec -min-count 10 -size 100 -window 10 -negative 5 -iter 2 -threads 16 -cbow 0 -binary 0"</tt>
* "'''id'''": NLPL identifier of the algorithm (an integer)
+
* "'''id'''": NLPL identifier of the algorithm (integer)
 
* "'''name'''": human-readable name of the algorithm, for example, <tt>"Gensim Continuous Skipgram"</tt>
 
* "'''name'''": human-readable name of the algorithm, for example, <tt>"Gensim Continuous Skipgram"</tt>
 
* "'''tool'''": tool used to train models with this algorithm, for example, <tt>"Gensim"</tt>
 
* "'''tool'''": tool used to train models with this algorithm, for example, <tt>"Gensim"</tt>
* "'''url'''": webpage of the tool used, for example, <tt>"https://github.com/RaRe-Technologies/gensim"</tt>
+
* "'''url'''": link to the tool used, for example, <tt>"https://github.com/RaRe-Technologies/gensim"</tt>
 
* "'''version'''": version of the tool used, for example, <tt>"3.6"</tt>
 
* "'''version'''": version of the tool used, for example, <tt>"3.6"</tt>
  
 
== Corpora ==
 
== Corpora ==
 +
This section lists the training corpora.
  
 
+
* "'''NER'''": whether multi-word named entities were detected and merged into one token (Boolean)
 +
* "'''case preserved'''": whether token case was left as is or lowered (Boolean)
 +
* "'''description'''": human-readable corpus description, for example, <tt>"Gigaword 5th Edition"</tt>
 +
* "'''id'''": NLPL identifier of the corpus (integer)
 +
* "'''language'''": 3-letter ISO language code, for example, <tt>"eng"</tt>
 +
* "'''lemmatized'''": whether the corpus is lemmatized (Boolean)
 +
* "'''public'''": whether the corpus is freely available (Boolean)
 +
* "'''stop words removal'''": if stop words were removed from the corpus, this field contains a human-readable description of the procedure (e.g., <tt>"all functional parts of speech"</tt>); otherwise, <tt>null</tt>
 +
* "'''tagger'''": human-readable description of the tagger/lemmatizer used (e.g., <tt>"Stanford Core NLP v. 3.6.0"</tt>); if no lemmatization was done, <tt>null</tt>
 +
* "'''tagset'''": human-readable description of the PoS tagset used (e.g., <tt>"UPoS"</tt>); if no tagging was done, <tt>null</tt>
 +
* "'''tokens'''": number of word tokens in the corpus (integer)
 +
* "'''tool'''": link to the tool used to generate the corpus, if any (e.g., <tt>"https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py"</tt>); otherwise, <tt>null</tt>
 +
* "'''url'''": link to the corpus, for example, <tt>null</tt>"https://catalog.ldc.upenn.edu/LDC2011T07"</tt>
  
 
== Models ==
 
== Models ==

Revision as of 00:15, 23 December 2019

This page describes the fields in the NLPL vector repository catalogue. The catalogue itself is a JSON file, for example, 20.json for the version 2.0 of the Repository.

All field except "id" are optional.

Algorithms

This section lists the distributional algorithms.

  • "command": exact command which was run to train the model, for example, "word2vec -min-count 10 -size 100 -window 10 -negative 5 -iter 2 -threads 16 -cbow 0 -binary 0"
  • "id": NLPL identifier of the algorithm (integer)
  • "name": human-readable name of the algorithm, for example, "Gensim Continuous Skipgram"
  • "tool": tool used to train models with this algorithm, for example, "Gensim"
  • "url": link to the tool used, for example, "https://github.com/RaRe-Technologies/gensim"
  • "version": version of the tool used, for example, "3.6"

Corpora

This section lists the training corpora.

  • "NER": whether multi-word named entities were detected and merged into one token (Boolean)
  • "case preserved": whether token case was left as is or lowered (Boolean)
  • "description": human-readable corpus description, for example, "Gigaword 5th Edition"
  • "id": NLPL identifier of the corpus (integer)
  • "language": 3-letter ISO language code, for example, "eng"
  • "lemmatized": whether the corpus is lemmatized (Boolean)
  • "public": whether the corpus is freely available (Boolean)
  • "stop words removal": if stop words were removed from the corpus, this field contains a human-readable description of the procedure (e.g., "all functional parts of speech"); otherwise, null
  • "tagger": human-readable description of the tagger/lemmatizer used (e.g., "Stanford Core NLP v. 3.6.0"); if no lemmatization was done, null
  • "tagset": human-readable description of the PoS tagset used (e.g., "UPoS"); if no tagging was done, null
  • "tokens": number of word tokens in the corpus (integer)
  • "tool": link to the tool used to generate the corpus, if any (e.g., "https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py"); otherwise, null
  • "url": link to the corpus, for example, null"https://catalog.ldc.upenn.edu/LDC2011T07"

Models