1

I am trying to use the preprocessing script contained in the flowers-sample (I saw that it has been modified today and it is no more deprecated). However, after installing the required packages, the pipeline fails and outputs these error logs

(caeb3b0a930d0a6): Workflow failed. Causes: (caeb3b0a930d587): S01:Save to disk/Write/WriteImpl/InitializeWrite failed.

and

(d50acb0dd46c44c6): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 666, in run
    self._load_main_session(self.local_staging_directory)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 411, in _load_main_session
    pickler.load_session(session_file)
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 230, in load_session
    return dill.load_session(file_path)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 363, in load_session
    module = unpickler.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named util

I get the same errors running the process from two different Google Compute Engines where I have installed the packages listed in requirements.txt.

Does it refer to the util.py file in the trainer directory or else, are there additional packages I should install to avoid this error?

EffePi
  • 356
  • 2
  • 13
  • Actually what's going on is that the util.py file is not shipped with the rest of the file into the dataflow containers. Will look into it. – Elmer Mar 30 '17 at 15:57
  • Can you post the command that you ran? – Pablo Mar 30 '17 at 18:12
  • This is the command I ran: python trainer/preprocess.py --input_dict "gs://path_to_files/dict.txt" --input_path "gs://path_to_files/train_data.csv" --output_path "gs://path_to_files/preproc/train" --cloud – EffePi Mar 31 '17 at 12:57

1 Answers1

1

I have found a workaround: in preprocess.py I have replaced the import of the util package with the definition of get_cloud_project() that is contained in util.py.

I don't know if the issue was caused by the local package employed on a dataflow. I don't think this is the case because get_cloud_project() is not called inside the pipeline definition, but this is the first time I use dataflow.

If someone else knows if it is possibile to make the code work without modifying it, please tell me!

EffePi
  • 356
  • 2
  • 13
  • The code was updated back to do this, since the util module doesn't get pushed to the dataflow container. Thanks for the feedback. – Elmer Mar 31 '17 at 16:34