Difference between revisions of "Vectors/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= Background = The purpose of the NLPL repository of word vectors (which can comprise both ‘classic’, count-based and ‘modern’, dense models) is to make available a l...")
 
(Background)
Line 9: Line 9:
 
[http://vectors.nlpl.eu/repository/ on-line explorer].
 
[http://vectors.nlpl.eu/repository/ on-line explorer].
 
The underlying data is stored in the NLPL project directory below
 
The underlying data is stored in the NLPL project directory below
<tt>/projects/nlpl/data/vectors/</tt>.
+
<tt>/projects/nlpl/data/vectors/</tt>.
 +
The repository is versioned, in the sense of assining release numbers to different
 +
stages of repository construction.
 +
The initial release (providing some two dozen models) was released in May 2017 as
 +
version 1.0.
 +
In early 2018, version 1.1 supersedes this initial release, adding a large number
 +
of models and languages and re-packaging the models from the original release in
 +
a more standardized format (see below).
 +
 
 +
= Repository Contents =
 +
 
 +
 
 +
= Using NLPL Models In-Situ =
 +
 
 +
To avoid data duplication, it is recommended to load models from the NLPL repository
 +
directly from the NLPL project directory, when working on Abel (or Taito, once the
 +
repository is replicated there).
 +
Repository entries are uniformly packaged as <tt>.zip</tt> compressed archives, but
 +
the uniform naming scheming makes it possible to directly read one or more of the
 +
model files from the archive.
 +
 
 +
In Python, for example, something along the following lines should work
 +
 
 +
import zipfile
 +
repository = "/projects/nlpl/data/vectors/11"
 +
with ZipFile(repository + "/30.zip", "r") as archive:
 +
  with archive.open("model.txt") as stream:
 +
    for entry in stream:

Revision as of 21:08, 23 January 2018

Background

The purpose of the NLPL repository of word vectors (which can comprise both ‘classic’, count-based and ‘modern’, dense models) is to make available a large and carefully curated collection of large-scale distributional models for many languages. For general background, please see Fares et al. (2017).

For interactive exploration and download of the repository, there is an on-line explorer. The underlying data is stored in the NLPL project directory below /projects/nlpl/data/vectors/. The repository is versioned, in the sense of assining release numbers to different stages of repository construction. The initial release (providing some two dozen models) was released in May 2017 as version 1.0. In early 2018, version 1.1 supersedes this initial release, adding a large number of models and languages and re-packaging the models from the original release in a more standardized format (see below).

Repository Contents

Using NLPL Models In-Situ

To avoid data duplication, it is recommended to load models from the NLPL repository directly from the NLPL project directory, when working on Abel (or Taito, once the repository is replicated there). Repository entries are uniformly packaged as .zip compressed archives, but the uniform naming scheming makes it possible to directly read one or more of the model files from the archive.

In Python, for example, something along the following lines should work

import zipfile
repository = "/projects/nlpl/data/vectors/11"
with ZipFile(repository + "/30.zip", "r") as archive:
 with archive.open("model.txt") as stream:
   for entry in stream: