Difference between revisions of "Vectors/home"
(→Repository Contents) |
(→Background) |
||
Line 9: | Line 9: | ||
[http://vectors.nlpl.eu/repository/ on-line explorer]. | [http://vectors.nlpl.eu/repository/ on-line explorer]. | ||
The underlying data is stored in the NLPL project directory below | The underlying data is stored in the NLPL project directory below | ||
− | <tt>/projects/nlpl/data/vectors/</tt>. | + | <tt>/projects/nlpl/data/vectors/</tt> (on Abel) |
− | The repository is versioned, in the sense of | + | and <tt>/proj/nlpl/data/vectors/</tt> (on Taito). |
+ | The repository is versioned, in the sense of assigning release numbers to different | ||
stages of repository construction. | stages of repository construction. | ||
+ | Each repository entry, thus, is assigned a unique and persistent identifier; | ||
+ | once published, a repository entry will never change (to aid replicability). | ||
The initial release (providing some two dozen models) was released in May 2017 as | The initial release (providing some two dozen models) was released in May 2017 as | ||
version 1.0. | version 1.0. | ||
− | In | + | In March 2018, version 1.1 supersedes this initial release, adding a large number |
of models and languages and re-packaging the models from the original release in | of models and languages and re-packaging the models from the original release in | ||
a more standardized format (see below). | a more standardized format (see below). |
Revision as of 08:56, 11 March 2018
Background
The purpose of the NLPL repository of word vectors (which can comprise both ‘classic’, count-based and ‘modern’, dense models) is to make available a large and carefully curated collection of large-scale distributional models for many languages. For general background, please see Fares et al. (2017).
For interactive exploration and download of the repository, there is an on-line explorer. The underlying data is stored in the NLPL project directory below /projects/nlpl/data/vectors/ (on Abel) and /proj/nlpl/data/vectors/ (on Taito). The repository is versioned, in the sense of assigning release numbers to different stages of repository construction. Each repository entry, thus, is assigned a unique and persistent identifier; once published, a repository entry will never change (to aid replicability). The initial release (providing some two dozen models) was released in May 2017 as version 1.0. In March 2018, version 1.1 supersedes this initial release, adding a large number of models and languages and re-packaging the models from the original release in a more standardized format (see below).
Repository Contents
The on-line browser dynamically presents parts of the information encoded for programmatic access in the repository catalogue, which is represented as a JSON file in the top-level repository directory, with catalogue names corresponding to each repository version, e.g. /projects/nlpl/data/vectors/10.json for the initial repository release.
The catalogue contains three top-level sections, one each for corpora (data sources), algorithms (model creation tools), and models (resulting sets of word vectors). NLPL users with access to Abel (or Taito, once the repository is replicated there) can read the catalogue file directly from the project directory, for example when executing a series of experiments that make use of different pre-trained sets of word vectors.
Using NLPL Models In-Situ
To avoid data duplication, it is recommended to load models from the NLPL repository directly from the NLPL project directory, when working on Abel (or Taito, once the repository is replicated there). Repository entries are uniformly packaged as .zip compressed archives, but the uniform naming scheming makes it possible to directly read one or more of the model files from the archive.
In Python, for example, something along the following lines should work to iterate over all of the entries in the model
import zipfile import gensim repository = "/projects/nlpl/data/vectors/11" with zipfile.ZipFile(repository + "/30.zip", "r") as archive: stream = archive.open("model.txt") for line in stream: ...
Alternatively, if working in a framework like gensim
model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=False)