2

I have trained and saved a model with doc2vec in colab as

model = gensim.models.Doc2Vec(vector_size=size_of_vector, window=10, min_count=5, workers=16,alpha=0.025, min_alpha=0.025, epochs=40)
model.build_vocab(allXs)
model.train(allXs, epochs=model.epochs, total_examples=model.corpus_count)

The model is saved in a folder not accessible from my drive but which I can see as:

from os import listdir
from os.path import isfile, getsize
from operator import itemgetter

files = [(f, getsize(f)) for f in listdir('.') if isfile(f)]
files.sort(key=itemgetter(1), reverse=True)

for f, size in files:
    print ('{} {}'.format(size, f))
print ('({} files {} total size)'.format(len(files), sum(f[1] for f in files)))

The output is:

79434928 Model_after_train.docvecs.vectors_docs.npy
9155086 Model_after_train
1024 .rnd
(3 files 88591038 total size)

To move the two files in the same shared directory as the notebook

folder_id = FolderID

for f, size in files:
  if 'our_first_lda' in f:  
    file = drive.CreateFile({'parents':[{u'id': folder_id}]})
    file.SetContentFile(f)
    file.Upload()

The problem that I am facing now are two: 1) gensim creates two files when saving the model. Which one should I load?

2) when I try to load a file or the other with:

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

from googleapiclient.discovery import build
drive_service = build('drive', 'v3')

file_id = FileID


import io
from googleapiclient.http import MediaIoBaseDownload

request = drive_service.files().get_media(fileId=file_id)
downloaded = io.BytesIO()
downloader = MediaIoBaseDownload(downloaded, request)
done = False
while done is False:
  _, done = downloader.next_chunk()
model = doc2vec.Doc2Vec.load(downloaded.read())

I am not able to load the model getting the error:

TypeError: file() argument 1 must be encoded string without null bytes, not str

Any suggestion?

Valerio D. Ciotti
  • 1,369
  • 2
  • 17
  • 27

1 Answers1

0

I've never used gensim, but from a look at the docs, here's what I think is going on:

  1. You're getting two files because you passed separately=True to save, which is saving large numpy arrays in the output as separate files. You'll want to copy both files around.

  2. Based on the load docs, you want to pass a filename, not the contents of the file. So when fetching the file from Drive, save to a file, and pass mmap='r' to load.

If that doesn't get you up and running, it'd be helpful to see a complete example (eg with fake data).

Craig Citro
  • 6,505
  • 1
  • 31
  • 28
  • The default value for `separately` is `None`. However, from documentation it seems that "if None - automatically detect large numpy/scipy.sparse arrays in the object being stored, and store them into separate files", which is a quite counterintuitive. The other possibility is to use `list of str` which seems to create again two separated files: "if list of str - this attributes will be stored in separate files, the automatic check is not performed in this case". However, trying `separately='list of str'` I get `TypeError: cannot concatenate 'str' and 'list' objects` – Valerio D. Ciotti Mar 26 '18 at 09:50