Revision as of 19:11, 15 November 2019

Background

ELMo is a family of contextualized word embeddings first introduced in [Peter et al. 2018].

Using pre-trained models

Pre-trained ELMo models are available from the NLPL Word Embeddings repository.

Python code to infer contextualized word vectors from any input text, given a pre-trained model:

Training ELMo on Saga

There are currently two options of training ELMo on Saga with GPU-enabled TensorFlow: using a system TensorFlow module, or using Anaconda.

In both cases, one should use the code from https://github.com/akutuzov/bilm-tf to train a model. It boils down to running the command

```python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT```

where

$DATA is a path to the directory containing any number of (possibly gzipped) plain text files: your training corpus.

$SIZE if the number of word tokens in $DATA (necessary to properly construct and log batches).

$VOCAB is a (possibly gzipped) one-word-per-line vocabulary file to be used for language modeling; it should always contain at least <S>, </S> and <UNK>.

$OUT is a directory where the TensorFlow checkpoints will be saved.

Using system TensorFlow

Using Anaconda

If using Anaconda, you should install the tensorflow-gpu Python package locally.

Example SLURM file:

#!/bin/bash
#SBATCH --job-name=elmo
#SBATCH --mail-type=FAIL
#SBATCH --account=nn9447k  # Use your project number
#SBATCH --partition=accel    # To use the accelerator nodes
#SBATCH --gres=gpu:2         # To specify how many GPUs to use
#SBATCH --time=10:00:00      # Max walltime is 14 days.
#SBATCH --mem-per-cpu=6G
#SBATCH --ntasks=8
set -o errexit  # Recommended for easier debugging
module purge   # Recommended for reproducibility
module load Anaconda3/2019.03
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/cluster/software/Anaconda3/2019.03/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
   eval "$__conda_setup"
else
   if [ -f "/cluster/software/Anaconda3/2019.03/etc/profile.d/conda.sh" ]; then
               . "/cluster/software/Anaconda3/2019.03/etc/profile.d/conda.sh"
   else
       export PATH="/cluster/software/Anaconda3/2019.03/bin:$PATH"
   fi
fi
unset __conda_setup
# <<< conda initialize <<<
conda activate python3.6
python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT

@@ Line 12: / Line 12: @@
 = Training ELMo on Saga =
-As of now, one should use ''Anaconda'' to get working GPU-enabled ''TensorFlow'' on Saga.
+There are currently two options of training ELMo on Saga with GPU-enabled ''TensorFlow'': using a system ''TensorFlow'' module, or using ''Anaconda''.
-tensorflow-gpu ''Python'' package is then installed locally.
-After that, the code from https://github.com/akutuzov/bilm-tf can be used to train a model. More instructions to appear later.
+In both cases, one should use the code from https://github.com/akutuzov/bilm-tf to train a model. It boils down to running the command
+```python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT```
+where
+$DATA is a path to the directory containing any number of (possibly gzipped) plain text files: your training corpus.
+$SIZE if the number of word tokens in $DATA (necessary to properly construct and log batches).
+$VOCAB is a (possibly gzipped) one-word-per-line vocabulary file to be used for language modeling; it should always contain at least <nowiki><S></nowiki>, <nowiki></S></nowiki> and <UNK>.
+$OUT is a directory where the TensorFlow checkpoints will be saved.
+== Using system TensorFlow ==
+== Using Anaconda ==
+If using ''Anaconda'', you should install the tensorflow-gpu ''Python'' package locally.
 Example SLURM file:
@@ Line 47: / Line 65: @@
   conda activate python3.6
   python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT
-$DATA is a path to the directory containing any number of (possibly gzipped) plain text files: your training corpus.
-$SIZE if the number of word tokens in $DATA (necessary to properly construct and log batches).
-$VOCAB is a (possibly gzipped) one-word-per-line vocabulary file; it should always contain at least <nowiki><S></nowiki>, <nowiki></S></nowiki> and <UNK>.
-$OUT is a directory where the TensorFlow checkpoints will be saved.

Difference between revisions of "Vectors/elmo/tutorial"

Revision as of 19:11, 15 November 2019

Contents

Background

Using pre-trained models

Training ELMo on Saga

Using system TensorFlow

Using Anaconda

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools