Difference between revisions of "Vectors/metadata"
(Created page with "This page describes the fields in the [http://vectors.nlpl.eu/repository/ NLPL vector repository] catalogue. The catalogue itself is a <tt>JSON</tt> file, for example, <tt>20....") |
m (Oe moved page Repository catalogue fields to Vectors/metadata) |
||
(6 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
This page describes the fields in the [http://vectors.nlpl.eu/repository/ NLPL vector repository] catalogue. The catalogue itself is a <tt>JSON</tt> file, for example, <tt>20.json</tt> for the version 2.0 of the Repository. | This page describes the fields in the [http://vectors.nlpl.eu/repository/ NLPL vector repository] catalogue. The catalogue itself is a <tt>JSON</tt> file, for example, <tt>20.json</tt> for the version 2.0 of the Repository. | ||
− | All field except "id" are optional. | + | All field except <tt>"id"</tt> are optional. |
== Algorithms == | == Algorithms == | ||
− | + | This section lists the distributional algorithms. | |
− | + | ||
− | + | * "'''command'''": exact command which was run to train the model, for example, <tt>"word2vec -min-count 10 -size 100 -window 10 -negative 5 -iter 2 -threads 16 -cbow 0 -binary 0"</tt> | |
− | + | * "'''id'''": NLPL identifier of the algorithm (integer) | |
− | + | * "'''name'''": human-readable name of the algorithm, for example, <tt>"Gensim Continuous Skipgram"</tt> | |
− | + | * "'''tool'''": tool used to train models with this algorithm, for example, <tt>"Gensim"</tt> | |
+ | * "'''url'''": link to the tool used, for example, <tt>"https://github.com/RaRe-Technologies/gensim"</tt> | ||
+ | * "'''version'''": version of the tool used, for example, <tt>"3.6"</tt> | ||
== Corpora == | == Corpora == | ||
+ | This section lists the training corpora. | ||
− | + | * "'''NER'''": whether multi-word named entities were detected and merged into one token (Boolean) | |
+ | * "'''case preserved'''": whether token case was left as is or lowered (Boolean) | ||
+ | * "'''description'''": human-readable corpus description, for example, <tt>"Gigaword 5th Edition"</tt> | ||
+ | * "'''id'''": NLPL identifier of the corpus (integer) | ||
+ | * "'''language'''": 3-letter ISO language code, for example, <tt>"eng"</tt> | ||
+ | * "'''lemmatized'''": whether the corpus is lemmatized (Boolean) | ||
+ | * "'''public'''": whether the corpus is freely available (Boolean) | ||
+ | * "'''stop words removal'''": if stop words were removed from the corpus, this field contains a human-readable description of the procedure (e.g., <tt>"all functional parts of speech"</tt>); otherwise, <tt>null</tt> | ||
+ | * "'''tagger'''": human-readable description of the tagger/lemmatizer used (e.g., <tt>"Stanford Core NLP v. 3.6.0"</tt>); if no lemmatization was done, <tt>null</tt> | ||
+ | * "'''tagset'''": human-readable description of the PoS tagset used (e.g., <tt>"UPoS"</tt>); if no tagging was done, <tt>null</tt> | ||
+ | * "'''tokens'''": number of word tokens in the corpus (integer) | ||
+ | * "'''tool'''": link to the tool used to generate the corpus, if any (e.g., <tt>"https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py"</tt>); otherwise, <tt>null</tt> | ||
+ | * "'''url'''": link to the corpus, for example, <tt>"https://catalog.ldc.upenn.edu/LDC2011T07"</tt> | ||
== Models == | == Models == | ||
+ | This section lists the distributional models themselves. | ||
+ | * "'''algorithm'''": NLPL identifier of the training algorithm used (integer) | ||
+ | * "'''contents'''": a list of files stored in the model ZIP archive; each file is described as a dictionary with the <tt>filename</tt> and <tt>format</tt> fields | ||
+ | * "'''corpus'''": a list of integer NLPL identifiers of the training corpora, for example, <tt>[0, 3]</tt> | ||
+ | * "'''creators'''": a list of persons who trained the model; each person is described as a dictionary with the <tt>email</tt> and <tt>name</tt> fields | ||
+ | * "'''dimensions'''": dimensionality of the word representations in the model (integer), | ||
+ | * "'''documentation'''": usually a link to the details about the model or to the code to run it, for example, <tt>"https://github.com/ltgoslo/simple_elmo"</tt> | ||
+ | * "'''handle'''": persistent NLPL handler of the model, for example, <tt>"http://vectors.nlpl.eu/repository/20/1.zip"</tt> | ||
+ | * "'''id'''": persistent NLPL identifier of the model (integer) | ||
+ | * "'''iterations'''": how many epochs (passes over the corpus) were made during training (integer), | ||
+ | * "'''vocabulary size'''": the number of words in the model vocabulary, if applicable (integer) | ||
+ | * "'''window'''": context window size used during training, if applicable (integer) |
Latest revision as of 21:35, 28 December 2019
This page describes the fields in the NLPL vector repository catalogue. The catalogue itself is a JSON file, for example, 20.json for the version 2.0 of the Repository.
All field except "id" are optional.
Algorithms
This section lists the distributional algorithms.
- "command": exact command which was run to train the model, for example, "word2vec -min-count 10 -size 100 -window 10 -negative 5 -iter 2 -threads 16 -cbow 0 -binary 0"
- "id": NLPL identifier of the algorithm (integer)
- "name": human-readable name of the algorithm, for example, "Gensim Continuous Skipgram"
- "tool": tool used to train models with this algorithm, for example, "Gensim"
- "url": link to the tool used, for example, "https://github.com/RaRe-Technologies/gensim"
- "version": version of the tool used, for example, "3.6"
Corpora
This section lists the training corpora.
- "NER": whether multi-word named entities were detected and merged into one token (Boolean)
- "case preserved": whether token case was left as is or lowered (Boolean)
- "description": human-readable corpus description, for example, "Gigaword 5th Edition"
- "id": NLPL identifier of the corpus (integer)
- "language": 3-letter ISO language code, for example, "eng"
- "lemmatized": whether the corpus is lemmatized (Boolean)
- "public": whether the corpus is freely available (Boolean)
- "stop words removal": if stop words were removed from the corpus, this field contains a human-readable description of the procedure (e.g., "all functional parts of speech"); otherwise, null
- "tagger": human-readable description of the tagger/lemmatizer used (e.g., "Stanford Core NLP v. 3.6.0"); if no lemmatization was done, null
- "tagset": human-readable description of the PoS tagset used (e.g., "UPoS"); if no tagging was done, null
- "tokens": number of word tokens in the corpus (integer)
- "tool": link to the tool used to generate the corpus, if any (e.g., "https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py"); otherwise, null
- "url": link to the corpus, for example, "https://catalog.ldc.upenn.edu/LDC2011T07"
Models
This section lists the distributional models themselves.
- "algorithm": NLPL identifier of the training algorithm used (integer)
- "contents": a list of files stored in the model ZIP archive; each file is described as a dictionary with the filename and format fields
- "corpus": a list of integer NLPL identifiers of the training corpora, for example, [0, 3]
- "creators": a list of persons who trained the model; each person is described as a dictionary with the email and name fields
- "dimensions": dimensionality of the word representations in the model (integer),
- "documentation": usually a link to the details about the model or to the code to run it, for example, "https://github.com/ltgoslo/simple_elmo"
- "handle": persistent NLPL handler of the model, for example, "http://vectors.nlpl.eu/repository/20/1.zip"
- "id": persistent NLPL identifier of the model (integer)
- "iterations": how many epochs (passes over the corpus) were made during training (integer),
- "vocabulary size": the number of words in the model vocabulary, if applicable (integer)
- "window": context window size used during training, if applicable (integer)