Difference between revisions of "Vectors/elmo/tutorial"
(→Training ELMo on Saga) |
|||
(17 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
[https://allennlp.org/elmo ELMo] is a family of contextualized word embeddings first introduced in [Peter et al. 2018]. | [https://allennlp.org/elmo ELMo] is a family of contextualized word embeddings first introduced in [Peter et al. 2018]. | ||
+ | |||
+ | = Using pre-trained models = | ||
+ | Pre-trained ELMo models are available from the [http://vectors.nlpl.eu/repository/ NLPL Word Embeddings repository]. | ||
+ | |||
+ | ''Python'' code to infer contextualized word vectors from any input text, given a pre-trained model: | ||
+ | |||
+ | https://github.com/ltgoslo/simple_elmo | ||
= Training ELMo on Saga = | = Training ELMo on Saga = | ||
+ | |||
+ | We recommend to use the code from https://github.com/ltgoslo/simple_elmo_training to train an ELMo model with ''TensorFlow'' . | ||
+ | |||
+ | After cloning the repository and installing the dependencies, it boils down to running the command | ||
+ | |||
+ | python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT | ||
+ | |||
+ | where | ||
+ | |||
+ | $DATA is a path to the directory containing any number of (possibly gzipped) plain text files: your training corpus. | ||
+ | |||
+ | $SIZE if the number of word tokens in $DATA (necessary to properly construct and log batches). | ||
+ | |||
+ | $VOCAB is a (possibly gzipped) one-word-per-line vocabulary file to be used for language modeling; it should always contain at least <nowiki><S></nowiki>, <nowiki></S></nowiki> and <UNK>. | ||
+ | |||
+ | $OUT is a directory where the TensorFlow checkpoints will be saved. | ||
+ | |||
+ | |||
+ | There are currently three options of training ELMo on Saga with GPU-enabled ''TensorFlow'': | ||
+ | *using the ''NLPL'' environment (recommended) | ||
+ | *using a system ''TensorFlow'' module | ||
+ | *using ''Anaconda''. | ||
+ | The speed is comparable: one epoch over 100 million word tokens takes about 3 hours with 2 NVIDIA P100 GPUs and batch size 192. | ||
+ | |||
+ | == Using the ''NLPL'' environment == | ||
+ | You will need to load the ''NLPL''-provided modules '''nlpl-python-candy/201912/3.7''' and '''nlpl-tensorflow/1.15.2/3.7''' | ||
+ | |||
+ | Example SLURM file: | ||
+ | |||
+ | #!/bin/bash | ||
+ | #SBATCH --job-name=ELMo | ||
+ | #SBATCH --mail-type=FAIL | ||
+ | #SBATCH --account=nn9447k # Use your project number | ||
+ | #SBATCH --partition=accel # To use the accelerator nodes | ||
+ | #SBATCH --gres=gpu:2 # To specify how many GPUs to use | ||
+ | #SBATCH --time=10:00:00 # Max walltime is 14 days. | ||
+ | #SBATCH --mem-per-cpu=6G | ||
+ | #SBATCH --ntasks=8 | ||
+ | set -o errexit # Recommended for easier debugging | ||
+ | ## Load your modules | ||
+ | module purge # Recommended for reproducibility | ||
+ | module use -a /cluster/shared/nlpl/software/modules/etc | ||
+ | module load nlpl-python-candy/201912/3.7 nlpl-tensorflow/1.15.2/3.7 | ||
+ | python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT | ||
+ | |||
+ | == Using system TensorFlow == | ||
+ | If using system ''TensorFlow'' (TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6), you do not have to install it locally. | ||
+ | |||
+ | Example SLURM file: | ||
+ | |||
+ | #!/bin/bash | ||
+ | #SBATCH --job-name=elmo | ||
+ | #SBATCH --mail-type=FAIL | ||
+ | #SBATCH --account=nn9447k # Use your project number | ||
+ | #SBATCH --partition=accel # To use the accelerator nodes | ||
+ | #SBATCH --gres=gpu:2 # To specify how many GPUs to use | ||
+ | #SBATCH --time=10:00:00 # Max walltime is 14 days. | ||
+ | #SBATCH --mem-per-cpu=6G | ||
+ | #SBATCH --ntasks=8 | ||
+ | set -o errexit # Recommended for easier debugging | ||
+ | ## Load your modules | ||
+ | module purge # Recommended for reproducibility | ||
+ | module load TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6 | ||
+ | python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT | ||
+ | |||
+ | == Using Anaconda == | ||
+ | If using ''Anaconda'', you should install the tensorflow-gpu ''Python'' package locally. The profit is that you can choose (to some extent) the version of ''TensorFlow''. | ||
Example SLURM file: | Example SLURM file: | ||
Line 34: | Line 108: | ||
# <<< conda initialize <<< | # <<< conda initialize <<< | ||
conda activate python3.6 | conda activate python3.6 | ||
− | python3 bin/train_elmo.py --train_prefix $DATA | + | python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT |
Latest revision as of 00:12, 3 February 2020
Contents
Background
ELMo is a family of contextualized word embeddings first introduced in [Peter et al. 2018].
Using pre-trained models
Pre-trained ELMo models are available from the NLPL Word Embeddings repository.
Python code to infer contextualized word vectors from any input text, given a pre-trained model:
https://github.com/ltgoslo/simple_elmo
Training ELMo on Saga
We recommend to use the code from https://github.com/ltgoslo/simple_elmo_training to train an ELMo model with TensorFlow .
After cloning the repository and installing the dependencies, it boils down to running the command
python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT
where
$DATA is a path to the directory containing any number of (possibly gzipped) plain text files: your training corpus.
$SIZE if the number of word tokens in $DATA (necessary to properly construct and log batches).
$VOCAB is a (possibly gzipped) one-word-per-line vocabulary file to be used for language modeling; it should always contain at least <S>, </S> and <UNK>.
$OUT is a directory where the TensorFlow checkpoints will be saved.
There are currently three options of training ELMo on Saga with GPU-enabled TensorFlow:
- using the NLPL environment (recommended)
- using a system TensorFlow module
- using Anaconda.
The speed is comparable: one epoch over 100 million word tokens takes about 3 hours with 2 NVIDIA P100 GPUs and batch size 192.
Using the NLPL environment
You will need to load the NLPL-provided modules nlpl-python-candy/201912/3.7 and nlpl-tensorflow/1.15.2/3.7
Example SLURM file:
#!/bin/bash #SBATCH --job-name=ELMo #SBATCH --mail-type=FAIL #SBATCH --account=nn9447k # Use your project number #SBATCH --partition=accel # To use the accelerator nodes #SBATCH --gres=gpu:2 # To specify how many GPUs to use #SBATCH --time=10:00:00 # Max walltime is 14 days. #SBATCH --mem-per-cpu=6G #SBATCH --ntasks=8 set -o errexit # Recommended for easier debugging ## Load your modules module purge # Recommended for reproducibility module use -a /cluster/shared/nlpl/software/modules/etc module load nlpl-python-candy/201912/3.7 nlpl-tensorflow/1.15.2/3.7 python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT
Using system TensorFlow
If using system TensorFlow (TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6), you do not have to install it locally.
Example SLURM file:
#!/bin/bash #SBATCH --job-name=elmo #SBATCH --mail-type=FAIL #SBATCH --account=nn9447k # Use your project number #SBATCH --partition=accel # To use the accelerator nodes #SBATCH --gres=gpu:2 # To specify how many GPUs to use #SBATCH --time=10:00:00 # Max walltime is 14 days. #SBATCH --mem-per-cpu=6G #SBATCH --ntasks=8 set -o errexit # Recommended for easier debugging ## Load your modules module purge # Recommended for reproducibility module load TensorFlow/1.13.1-fosscuda-2018b-Python-3.6.6 python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT
Using Anaconda
If using Anaconda, you should install the tensorflow-gpu Python package locally. The profit is that you can choose (to some extent) the version of TensorFlow.
Example SLURM file:
#!/bin/bash #SBATCH --job-name=elmo #SBATCH --mail-type=FAIL #SBATCH --account=nn9447k # Use your project number #SBATCH --partition=accel # To use the accelerator nodes #SBATCH --gres=gpu:2 # To specify how many GPUs to use #SBATCH --time=10:00:00 # Max walltime is 14 days. #SBATCH --mem-per-cpu=6G #SBATCH --ntasks=8 set -o errexit # Recommended for easier debugging module purge # Recommended for reproducibility module load Anaconda3/2019.03 # >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/cluster/software/Anaconda3/2019.03/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/cluster/software/Anaconda3/2019.03/etc/profile.d/conda.sh" ]; then . "/cluster/software/Anaconda3/2019.03/etc/profile.d/conda.sh" else export PATH="/cluster/software/Anaconda3/2019.03/bin:$PATH" fi fi unset __conda_setup # <<< conda initialize <<< conda activate python3.6 python3 bin/train_elmo.py --train_prefix $DATA --size $SIZE --vocab_file $VOCAB --save_dir $OUT