Difference between revisions of "Vectors/home"
(→Using NLPL Models In-Situ) |
(→Background) |
||
(22 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= Background = | = Background = | ||
− | The purpose of the NLPL repository of word vectors (which can comprise both ‘classic’, count-based and ‘modern’, dense models) is | + | The purpose of the NLPL repository of word vectors (or embeddings, which can comprise both ‘classic’, count-based and ‘modern’, dense models, including the contextualized ones) is |
− | to make available a large and carefully curated collection of large-scale distributional models for many languages. | + | to make available a large and carefully curated collection of large-scale distributional semantic models for many languages. |
For general background, please see | For general background, please see | ||
[http://www.ep.liu.se/ecp/article.asp?issue=131&article=037 Fares et al. (2017)]. | [http://www.ep.liu.se/ecp/article.asp?issue=131&article=037 Fares et al. (2017)]. | ||
Line 8: | Line 8: | ||
For interactive exploration and download of the repository, there is an | For interactive exploration and download of the repository, there is an | ||
[http://vectors.nlpl.eu/repository/ on-line explorer]. | [http://vectors.nlpl.eu/repository/ on-line explorer]. | ||
− | The underlying data is stored in the NLPL | + | The underlying data is stored in the NLPL community directory below |
− | <tt>/ | + | <tt>/cluster/shared/nlpl/data/vectors/</tt> (on Saga) |
− | and <tt>/ | + | and <tt>/projappl/nlpl/data/vectors/</tt> (on Puhti). |
The repository is versioned, in the sense of assigning release numbers to different | The repository is versioned, in the sense of assigning release numbers to different | ||
stages of repository construction. | stages of repository construction. | ||
Each repository entry, thus, is assigned a unique and persistent identifier; | Each repository entry, thus, is assigned a unique and persistent identifier; | ||
once published, a repository entry will never change (to aid replicability). | once published, a repository entry will never change (to aid replicability). | ||
− | The initial release (providing some two dozen models) was | + | The initial release (providing some two dozen models) was published in May 2017 as |
version 1.0. | version 1.0. | ||
In March 2018, version 1.1 supersedes this initial release, adding a large number | In March 2018, version 1.1 supersedes this initial release, adding a large number | ||
− | of models and languages and re-packaging the models from the original release in | + | of models and languages (including those from the 2017 |
− | a more standardized format (see below). | + | [http://hdl.handle.net/11234/1-1989 UD parsing shared task]) |
+ | and re-packaging the models from the original release in a more standardized | ||
+ | format (see below). | ||
+ | In December 2019, version 2.0 was released, which added BERT and ELMo models (including large monolingual | ||
+ | [http://wiki.nlpl.eu/Vectors/norlm models for Norwegian]), made metadata more consistent, | ||
+ | and ensured that binary format models are always provided | ||
+ | (to increase the loading speed, in comparison to the models stored as plain text). | ||
= Repository Contents = | = Repository Contents = | ||
Line 26: | Line 32: | ||
for programmatic access in the repository ''catalogue'', which is represented | for programmatic access in the repository ''catalogue'', which is represented | ||
as a JSON file in the top-level repository directory, with catalogue names | as a JSON file in the top-level repository directory, with catalogue names | ||
− | corresponding to each repository version, e.g. <tt>/ | + | corresponding to each repository version, e.g. <tt>/cluster/shared/nlpl/data/vectors/20.json</tt> |
− | (on | + | (on Saga) for the current repository release. |
The catalogue contains three top-level sections, one each for <tt>corpora</tt> | The catalogue contains three top-level sections, one each for <tt>corpora</tt> | ||
(data sources), <tt>algorithms</tt> (model creation tools), and | (data sources), <tt>algorithms</tt> (model creation tools), and | ||
<tt>models</tt> (resulting sets of word vectors). | <tt>models</tt> (resulting sets of word vectors). | ||
− | NLPL users with access to | + | NLPL users with access to Saga and Puhti can read the |
catalogue file directly from the | catalogue file directly from the | ||
− | + | NLPL community directory, for example when executing a series of experiments | |
that make use of different pre-trained sets of word vectors. | that make use of different pre-trained sets of word vectors. | ||
+ | Further | ||
+ | [http://wiki.nlpl.eu/index.php/Vectors/metadata documentation] of the catalogue metadata is available | ||
+ | as a [http://wiki.nlpl.eu/index.php/Vectors/metadata separate page]. | ||
Each repository entry (i.e. set of word vectors, or ‘model’) is | Each repository entry (i.e. set of word vectors, or ‘model’) is | ||
packaged in the form of a <tt>.zip</tt> archive, with uniform | packaged in the form of a <tt>.zip</tt> archive, with uniform | ||
− | conventions for file naming inside the file, using the | + | conventions for file naming inside the file, using the |
− | <tt>model.txt</tt> for the actual vectors. | + | <tt>model.txt</tt> and <tt>model.bin</tt> entries for the actual vectors. |
Each archive includes the relevant excerpts from the | Each archive includes the relevant excerpts from the | ||
catalogue as a file <tt>meta.json</tt> to help identify | catalogue as a file <tt>meta.json</tt> to help identify | ||
the specific contents; a <tt>README</tt> file included | the specific contents; a <tt>README</tt> file included | ||
with each model entry provides a life-time unique identifier, e.g. | with each model entry provides a life-time unique identifier, e.g. | ||
− | <tt>http://vectors.nlpl.eu/repository/ | + | <tt>http://vectors.nlpl.eu/repository/20/3.zip</tt> for |
− | model #3 in the | + | model #3 in the 2.0 release of the repository. |
= Using NLPL Models In-Situ = | = Using NLPL Models In-Situ = | ||
To avoid data duplication, it is recommended to load models from the NLPL repository | To avoid data duplication, it is recommended to load models from the NLPL repository | ||
− | directly from the NLPL | + | directly from the NLPL community directory, when working on Saga or Puhti. |
− | |||
Repository entries are uniformly packaged as <tt>.zip</tt> compressed archives, but | Repository entries are uniformly packaged as <tt>.zip</tt> compressed archives, but | ||
the uniform naming scheming makes it possible to directly read one or more of the | the uniform naming scheming makes it possible to directly read one or more of the | ||
Line 62: | Line 70: | ||
import zipfile | import zipfile | ||
import gensim | import gensim | ||
− | repository = "/ | + | repository = "/cluster/shared/nlpl/data/vectors/20" |
with zipfile.ZipFile(repository + "/30.zip", "r") as archive: | with zipfile.ZipFile(repository + "/30.zip", "r") as archive: | ||
stream = archive.open("model.txt") | stream = archive.open("model.txt") | ||
Line 68: | Line 76: | ||
... | ... | ||
− | Alternatively, if working in a framework like gensim | + | Alternatively, if working in a framework like ''gensim'' |
− | model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=False) | + | model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=False, unicode_errors='replace') |
+ | Binary ''fastText'' models (stored as <tt>parameters.bin</tt> files) should be first extracted from the <tt>.zip</tt> archive, and then loaded with | ||
+ | model = gensim.models.fasttext.load_facebook_vectors("parameters.bin") | ||
= Future Work = | = Future Work = | ||
− | The life-time | + | # Prepare Version 2.0, now with <code>/cluster/shared/nlpl/data/vectors/</code> as the master copy (DONE). |
− | in the JSON catalogue (in addition to being listed in the | + | # The life-time <tt>handle</tt> for each model should be included in the JSON catalogue (in addition to being listed in the <tt>README</tt> file) (DONE). |
− | <tt>README</tt> file). | + | # For classic models, redundantly add binary binary <code>model.bin</code> for faster loading (DONE). |
+ | # Each corpus should be listed as a separate entry; corpus combinations go into the array-valued <tt>corpus</tt> property on models (DONE). | ||
+ | # Where applicable, there should be an array-valued <tt>documentation</tt> field (of string, typically URLs) on corpora and models (DONE). | ||
+ | # The <tt>maintainers</tt> property may be over-promising, as often third-party models are in practice unmaintained; maybe rename to <tt>creator</tt> (DONE). | ||
+ | # Document (and possibly re-design) the metadata scheme; maybe invent a fourth category for a <tt>process</tt> applied prior model training. |
Latest revision as of 22:23, 12 January 2021
Background
The purpose of the NLPL repository of word vectors (or embeddings, which can comprise both ‘classic’, count-based and ‘modern’, dense models, including the contextualized ones) is to make available a large and carefully curated collection of large-scale distributional semantic models for many languages. For general background, please see Fares et al. (2017).
For interactive exploration and download of the repository, there is an on-line explorer. The underlying data is stored in the NLPL community directory below /cluster/shared/nlpl/data/vectors/ (on Saga) and /projappl/nlpl/data/vectors/ (on Puhti). The repository is versioned, in the sense of assigning release numbers to different stages of repository construction. Each repository entry, thus, is assigned a unique and persistent identifier; once published, a repository entry will never change (to aid replicability). The initial release (providing some two dozen models) was published in May 2017 as version 1.0. In March 2018, version 1.1 supersedes this initial release, adding a large number of models and languages (including those from the 2017 UD parsing shared task) and re-packaging the models from the original release in a more standardized format (see below). In December 2019, version 2.0 was released, which added BERT and ELMo models (including large monolingual models for Norwegian), made metadata more consistent, and ensured that binary format models are always provided (to increase the loading speed, in comparison to the models stored as plain text).
Repository Contents
The on-line browser dynamically presents parts of the information encoded for programmatic access in the repository catalogue, which is represented as a JSON file in the top-level repository directory, with catalogue names corresponding to each repository version, e.g. /cluster/shared/nlpl/data/vectors/20.json (on Saga) for the current repository release.
The catalogue contains three top-level sections, one each for corpora (data sources), algorithms (model creation tools), and models (resulting sets of word vectors). NLPL users with access to Saga and Puhti can read the catalogue file directly from the NLPL community directory, for example when executing a series of experiments that make use of different pre-trained sets of word vectors. Further documentation of the catalogue metadata is available as a separate page.
Each repository entry (i.e. set of word vectors, or ‘model’) is packaged in the form of a .zip archive, with uniform conventions for file naming inside the file, using the model.txt and model.bin entries for the actual vectors. Each archive includes the relevant excerpts from the catalogue as a file meta.json to help identify the specific contents; a README file included with each model entry provides a life-time unique identifier, e.g. http://vectors.nlpl.eu/repository/20/3.zip for model #3 in the 2.0 release of the repository.
Using NLPL Models In-Situ
To avoid data duplication, it is recommended to load models from the NLPL repository directly from the NLPL community directory, when working on Saga or Puhti. Repository entries are uniformly packaged as .zip compressed archives, but the uniform naming scheming makes it possible to directly read one or more of the model files from the archive.
In Python, for example, something along the following lines should work to iterate over all of the entries in the model
import zipfile import gensim repository = "/cluster/shared/nlpl/data/vectors/20" with zipfile.ZipFile(repository + "/30.zip", "r") as archive: stream = archive.open("model.txt") for line in stream: ...
Alternatively, if working in a framework like gensim
model = gensim.models.KeyedVectors.load_word2vec_format(stream, binary=False, unicode_errors='replace')
Binary fastText models (stored as parameters.bin files) should be first extracted from the .zip archive, and then loaded with
model = gensim.models.fasttext.load_facebook_vectors("parameters.bin")
Future Work
- Prepare Version 2.0, now with
/cluster/shared/nlpl/data/vectors/
as the master copy (DONE). - The life-time handle for each model should be included in the JSON catalogue (in addition to being listed in the README file) (DONE).
- For classic models, redundantly add binary binary
model.bin
for faster loading (DONE). - Each corpus should be listed as a separate entry; corpus combinations go into the array-valued corpus property on models (DONE).
- Where applicable, there should be an array-valued documentation field (of string, typically URLs) on corpora and models (DONE).
- The maintainers property may be over-promising, as often third-party models are in practice unmaintained; maybe rename to creator (DONE).
- Document (and possibly re-design) the metadata scheme; maybe invent a fourth category for a process applied prior model training.