Vectors/metadata
This page describes the fields in the NLPL vector repository catalogue. The catalogue itself is a JSON file, for example, 20.json for the version 2.0 of the Repository.
All field except "id" are optional.
Algorithms
This section lists the distributional algorithms.
- "command": exact command which was run to train the model, for example, "word2vec -min-count 10 -size 100 -window 10 -negative 5 -iter 2 -threads 16 -cbow 0 -binary 0"
- "id": NLPL identifier of the algorithm (integer)
- "name": human-readable name of the algorithm, for example, "Gensim Continuous Skipgram"
- "tool": tool used to train models with this algorithm, for example, "Gensim"
- "url": link to the tool used, for example, "https://github.com/RaRe-Technologies/gensim"
- "version": version of the tool used, for example, "3.6"
Corpora
This section lists the training corpora.
- "NER": whether multi-word named entities were detected and merged into one token (Boolean)
- "case preserved": whether token case was left as is or lowered (Boolean)
- "description": human-readable corpus description, for example, "Gigaword 5th Edition"
- "id": NLPL identifier of the corpus (integer)
- "language": 3-letter ISO language code, for example, "eng"
- "lemmatized": whether the corpus is lemmatized (Boolean)
- "public": whether the corpus is freely available (Boolean)
- "stop words removal": if stop words were removed from the corpus, this field contains a human-readable description of the procedure (e.g., "all functional parts of speech"); otherwise, null
- "tagger": human-readable description of the tagger/lemmatizer used (e.g., "Stanford Core NLP v. 3.6.0"); if no lemmatization was done, null
- "tagset": human-readable description of the PoS tagset used (e.g., "UPoS"); if no tagging was done, null
- "tokens": number of word tokens in the corpus (integer)
- "tool": link to the tool used to generate the corpus, if any (e.g., "https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py"); otherwise, null
- "url": link to the corpus, for example, "https://catalog.ldc.upenn.edu/LDC2011T07"