2

Hello I have been trying to contextual extract word embedding using the novel XLNet but without luck.

Running on Google Colab with TPU

I would like to note that I get this error when I use TPU so thus I switch to GPU to avoid the error

xlnet_config = xlnet.XLNetConfig(json_path=FLAGS.model_config_path)

AttributeError: module 'xlnet' has no attribute 'XLNetConfig'

However I get another error when I use GPU

run_config = xlnet.create_run_config(is_training=True, is_finetune=True, FLAGS=FLAGS)

AttributeError: use_tpu

I will post the whole code below: I am using a small sentence as an input till it work and I switch to big data then

Main Code:

import sentencepiece as spm
import numpy as np
import tensorflow as tf
from prepro_utils import preprocess_text, encode_ids
import xlnet
import sentencepiece as spm

text = "The metamorphic rocks of western Crete form a series some 9000 to 10,000 ft."
sp_model = spm.SentencePieceProcessor()
sp_model.Load("/content/xlnet_cased_L-24_H-1024_A-16/spiece.model")

text = preprocess_text(text) 
ids = encode_ids(sp_model, text)

#print('ids',ids)

# some code omitted here...
# initialize FLAGS
# initialize instances of tf.Tensor, including input_ids, seg_ids, and input_mask

# XLNetConfig contains hyperparameters that are specific to a model checkpoint.
xlnet_config = xlnet.XLNetConfig(json_path=FLAGS.model_config_path) **ERROR 1 HERE**
from absl import flags
import sys

FLAGS = flags.FLAGS
# RunConfig contains hyperparameters that could be different between pretraining and finetuning.
run_config = xlnet.create_run_config(is_training=True, is_finetune=True, FLAGS=FLAGS) **ERROR 2 HERE**
xp = []
xp.append(ids)
input_ids = np.asarray(xp)
xlnet_model = xlnet.XLNetModel(
    xlnet_config=xlnet_config,
    run_config=run_config,
    input_ids=input_ids,
    seg_ids=None,
    input_mask=None)
embed1=tf.train.load_variable('../data/xlnet_cased_L-24_H-1024_A-16/xlnet_model.ckpt','model/transformer/word_embedding/lookup_table:0')`

Before the main code I'm cloning Xlnet from GitHub and so on (I will also post it)

! pip install sentencepiece
#Download the pretrained XLNet model and unzip only needs to be done once
! wget https://storage.googleapis.com/xlnet/released_models/cased_L-24_H-1024_A-16.zip
! unzip cased_L-24_H-1024_A-16.zip
! git clone https://github.com/zihangdai/xlnet.git

SCRIPTS_DIR = 'xlnet' #@param {type:"string"}
DATA_DIR = 'aclImdb' #@param {type:"string"}
OUTPUT_DIR = 'proc_data/imdb' #@param {type:"string"}
PRETRAINED_MODEL_DIR = 'xlnet_cased_L-24_H-1024_A-16' #@param {type:"string"}
CHECKPOINT_DIR = 'exp/imdb' #@param {type:"string"}

train_command = "python xlnet/run_classifier.py \
  --do_train=True \
  --do_eval=True \
  --eval_all_ckpt=True \
  --task_name=imdb \
  --data_dir="+DATA_DIR+" \
  --output_dir="+OUTPUT_DIR+" \
  --model_dir="+CHECKPOINT_DIR+" \
  --uncased=False \
  --spiece_model_file="+PRETRAINED_MODEL_DIR+"/spiece.model \
  --model_config_path="+PRETRAINED_MODEL_DIR+"/xlnet_config.json \
  --init_checkpoint="+PRETRAINED_MODEL_DIR+"/xlnet_model.ckpt \
  --max_seq_length=128 \
  --train_batch_size=8 \
  --eval_batch_size=8 \
  --num_hosts=1 \
  --num_core_per_host=1 \
  --learning_rate=2e-5 \
  --train_steps=4000 \
  --warmup_steps=500 \
  --save_steps=500 \
  --iterations=500"

! {train_command}
IS92
  • 690
  • 1
  • 13
  • 28

1 Answers1

2

Check this gist out.

We have made it really easy to get token level embeddings from XLNet

Update: Updated gist .

For detailed documentation and more examples check Github

ashutosh singh
  • 511
  • 3
  • 15
  • Yes XLNet gives you contextual embeddings through language modeling. Happy to help...please accept the answer and do give a star to our repo..it would be a great help. – ashutosh singh Aug 24 '19 at 11:44
  • 1
    update: from embedding_as_service.text.encode import Encoder instead of : from models.encode import Encoder – IS92 Sep 04 '19 at 10:32
  • Yes we have made a new release. In the next release we will have most of these models as trainable layers!! – ashutosh singh Sep 05 '19 at 20:30
  • I am currently using your existing one in a project I hope that this version remains unchanged so my project continues working – IS92 Sep 08 '19 at 10:31
  • You don't have to worry – ashutosh singh Sep 08 '19 at 14:16
  • Thanks Ashutosh, does it output word embeddings of the last layer or layers? if it is just the last layer, can we also get word embedding for the hidden layers as well (i.e. last four layers)? – Sade Dec 17 '19 at 10:24
  • Yes this output is from the last layer. Its possible to get outputs from intermediate layers as well. It is not implemented there yet though. Which model are you talking about? – ashutosh singh Dec 17 '19 at 12:56
  • Im using xlnet and bert. – Sade Dec 18 '19 at 19:08
  • Check this [gist](https://colab.research.google.com/gist/ashutoshsingh0223/d6d673a942dd15546fc28e9fce875b51/embedding-as-service.ipynb). Look for title `Get outputs from intermediate layers from BERT` – ashutosh singh Dec 19 '19 at 06:12
  • Hi Ashutosh, I'm trying to get similarities of sentences using your module, but seems like it's not working well. I tested even for single words like 'USA' and 'America'. Similar sentences exhibit no difference than non-similar ones. Is your xlnet model pre-trained or not? Thanks. – E_learner May 03 '20 at 09:29
  • Yes, it is the pre-trained one. Can you post a snippet of your code here? – ashutosh singh May 03 '20 at 09:54
  • `from embedding_as_service.text.encode import Encoder xlnet_en = Encoder(embedding='xlnet', model='xlnet_large_cased', max_seq_length=256) s1 = "He repeatedly spoke about the benefits of spending the summer" s2 = "He repeatedly spoke about the advantages of spending the summer" v1 = xlnet_en.encode(texts=[s1], pooling='reduce_mean') v2 = xlnet_en.encode(texts=[s2], pooling='reduce_mean') from sklearn.metrics.pairwise import cosine_similarity dist = cosine_similarity(v1, v2)` – E_learner May 03 '20 at 10:45
  • Let me check and comeback – ashutosh singh May 03 '20 at 10:47
  • hi Ashutosh, did you have the chance to check my code? – E_learner May 03 '20 at 13:34
  • Hi. You are right, somehow the pre-trained weights are not loading up. Fixing this up. Sorry for the inconvenience. – ashutosh singh May 05 '20 at 19:15