-2

I'm using this code to get the embeddings of sentences that are in my dataset(I'm using my pretrained model).

`python extract_features.py \
  --input_file=/tmp/input.txt \
  --output_file=/tmp/output.jsonl \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4 \
  --max_seq_length=128 \
  --batch_size=32`

But, there is a problem: there is a way to get embeddings faster? Because for 2000 sentences it took 6 hours. My dataset contains 20000 sentences; 60 hours would be too long for Colab. Thanks.

1 Answers1

0

I resolved it. I wrote all of sentences in input.txt and after that I used this code:

import jsonlines
df_emb=pd.DataFrame()
with jsonlines.open('/content/tmp/output.jsonl') as f:
    for line in f.iter():
        s=line['features'][0]['layers'][0]['values']
        df_tmp=pd.DataFrame(s).T
        df_emb=df_emb.append(df_tmp,ignore_index=True)

After that I saved the dataframe in csv file