Difference between revisions of "Vectors/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Background)
(Future Work)
(16 intermediate revisions by 2 users not shown)
Line 9: Line 9:
 
[http://vectors.nlpl.eu/repository/ on-line explorer].
 
[http://vectors.nlpl.eu/repository/ on-line explorer].
 
The underlying data is stored in the NLPL project directory below
 
The underlying data is stored in the NLPL project directory below
<tt>/projects/nlpl/data/vectors/</tt>.
+
<tt>/projects/nlpl/data/vectors/</tt> (on Abel)
The repository is versioned, in the sense of assining release numbers to different
+
and <tt>/proj/nlpl/data/vectors/</tt> (on Taito).
 +
The repository is versioned, in the sense of assigning release numbers to different
 
stages of repository construction.
 
stages of repository construction.
The initial release (providing some two dozen models) was released in May 2017 as
+
Each repository entry, thus, is assigned a unique and persistent identifier;
 +
once published, a repository entry will never change (to aid replicability).
 +
The initial release (providing some two dozen models) was published in May 2017 as
 
version 1.0.
 
version 1.0.
In early 2018, version 1.1 supersedes this initial release, adding a large number
+
In March 2018, version 1.1 supersedes this initial release, adding a large number
of models and languages and re-packaging the models from the original release in
+
of models and languages (including those from the
a more standardized format (see below).
+
[http://hdl.handle.net/11234/1-1989 UD parsing task])
 +
and re-packaging the models from the original release in a more standardized
 +
format (see below).
  
 
= Repository Contents =
 
= Repository Contents =
  
 +
The on-line browser dynamically presents parts of the information encoded
 +
for programmatic access in the repository ''catalogue'', which is represented
 +
as a JSON file in the top-level repository directory, with catalogue names
 +
corresponding to each repository version, e.g. <tt>/projects/nlpl/data/vectors/10.json</tt>
 +
(on Abel) for the initial repository release.
 +
 +
The catalogue contains three top-level sections, one each for <tt>corpora</tt>
 +
(data sources), <tt>algorithms</tt> (model creation tools), and
 +
<tt>models</tt> (resulting sets of word vectors).
 +
NLPL users with access to Abel and Taito can read the
 +
catalogue file directly from the
 +
project directory, for example when executing a series of experiments
 +
that make use of different pre-trained sets of word vectors.
 +
 +
Each repository entry (i.e. set of word vectors, or ‘model’) is
 +
packaged in the form of a <tt>.zip</tt> archive, with uniform
 +
conventions for file naming inside the file, using the entry
 +
<tt>model.txt</tt> for the actual vectors.
 +
Each archive includes the relevant excerpts from the
 +
catalogue as a file <tt>meta.json</tt> to help identify
 +
the specific contents; a <tt>README</tt> file included
 +
with each model entry provides a life-time unique identifier, e.g.
 +
<tt>http://vectors.nlpl.eu/repository/11/3.zip</tt> for
 +
model #3 in the 1.1 release of the repository.
  
 
= Using NLPL Models In-Situ =
 
= Using NLPL Models In-Situ =
  
 
To avoid data duplication, it is recommended to load models from the NLPL repository
 
To avoid data duplication, it is recommended to load models from the NLPL repository
directly from the NLPL project directory, when working on Abel (or Taito, once the
+
directly from the NLPL project directory, when working on Abel or Taito.
repository is replicated there).
 
 
Repository entries are uniformly packaged as <tt>.zip</tt> compressed archives, but
 
Repository entries are uniformly packaged as <tt>.zip</tt> compressed archives, but
 
the uniform naming scheming makes it possible to directly read one or more of the
 
the uniform naming scheming makes it possible to directly read one or more of the
 
model files from the archive.
 
model files from the archive.
  
In Python, for example, something along the following lines should work
+
In Python, for example, something along the following lines should work to iterate
 +
over all of the entries in the model
  
 
  import zipfile
 
  import zipfile
 +
import gensim
 
  repository = "/projects/nlpl/data/vectors/11"
 
  repository = "/projects/nlpl/data/vectors/11"
  with ZipFile(repository + "/30.zip", "r") as archive:
+
  with zipfile.ZipFile(repository + "/30.zip", "r") as archive:
  with archive.open("model.txt") as stream:
+
  stream = archive.open("model.txt")
    for entry in stream:
+
  for line in stream:
 +
    ...
 +
 
 +
Alternatively, if working in a framework like gensim
 +
 
 +
  model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=False)
 +
 
 +
= Future Work =
 +
 
 +
# The life-time <tt>handle</tt> for each model should be included in the JSON catalogue (in addition to being listed in the <tt>README</tt> file).
 +
# Each corpus should be listed as a separate entry; corpus combinations go into the array-valued <tt>corpus</tt> property on models.
 +
# Where applicable, there should be an array-valued <tt>documentation</tt> field (of string, typically URLs) on corpora and models.
 +
# The <tt>maintainers</tt> property may be over-promising, as often third-party models are in practice unmaintained; maybe rename to <tt>creator</tt>?

Revision as of 21:06, 12 September 2018

Background

The purpose of the NLPL repository of word vectors (which can comprise both ‘classic’, count-based and ‘modern’, dense models) is to make available a large and carefully curated collection of large-scale distributional models for many languages. For general background, please see Fares et al. (2017).

For interactive exploration and download of the repository, there is an on-line explorer. The underlying data is stored in the NLPL project directory below /projects/nlpl/data/vectors/ (on Abel) and /proj/nlpl/data/vectors/ (on Taito). The repository is versioned, in the sense of assigning release numbers to different stages of repository construction. Each repository entry, thus, is assigned a unique and persistent identifier; once published, a repository entry will never change (to aid replicability). The initial release (providing some two dozen models) was published in May 2017 as version 1.0. In March 2018, version 1.1 supersedes this initial release, adding a large number of models and languages (including those from the UD parsing task) and re-packaging the models from the original release in a more standardized format (see below).

Repository Contents

The on-line browser dynamically presents parts of the information encoded for programmatic access in the repository catalogue, which is represented as a JSON file in the top-level repository directory, with catalogue names corresponding to each repository version, e.g. /projects/nlpl/data/vectors/10.json (on Abel) for the initial repository release.

The catalogue contains three top-level sections, one each for corpora (data sources), algorithms (model creation tools), and models (resulting sets of word vectors). NLPL users with access to Abel and Taito can read the catalogue file directly from the project directory, for example when executing a series of experiments that make use of different pre-trained sets of word vectors.

Each repository entry (i.e. set of word vectors, or ‘model’) is packaged in the form of a .zip archive, with uniform conventions for file naming inside the file, using the entry model.txt for the actual vectors. Each archive includes the relevant excerpts from the catalogue as a file meta.json to help identify the specific contents; a README file included with each model entry provides a life-time unique identifier, e.g. http://vectors.nlpl.eu/repository/11/3.zip for model #3 in the 1.1 release of the repository.

Using NLPL Models In-Situ

To avoid data duplication, it is recommended to load models from the NLPL repository directly from the NLPL project directory, when working on Abel or Taito. Repository entries are uniformly packaged as .zip compressed archives, but the uniform naming scheming makes it possible to directly read one or more of the model files from the archive.

In Python, for example, something along the following lines should work to iterate over all of the entries in the model

import zipfile
import gensim
repository = "/projects/nlpl/data/vectors/11"
with zipfile.ZipFile(repository + "/30.zip", "r") as archive:
  stream = archive.open("model.txt")
  for line in stream:
    ...

Alternatively, if working in a framework like gensim

  model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=False)

Future Work

  1. The life-time handle for each model should be included in the JSON catalogue (in addition to being listed in the README file).
  2. Each corpus should be listed as a separate entry; corpus combinations go into the array-valued corpus property on models.
  3. Where applicable, there should be an array-valued documentation field (of string, typically URLs) on corpora and models.
  4. The maintainers property may be over-promising, as often third-party models are in practice unmaintained; maybe rename to creator?