0

I tried to install dask on google composer (airflow). I used pypi (GCP UI) to add dask and the below required packages(not sure if all the google one are required though, couldn't find requirement.txt):

 dask
 toolz
 partd
 cloudpickle
 google-cloud
 google-cloud-storage
 google-auth
 google-auth-oauthlib
 decorator

when I run my DAG that has dd.read_csv("a gcp bucket") it shows the below error in airflow log:

    [2018-10-24 22:25:12,729] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 350, in get_fs_token_paths
    [2018-10-24 22:25:12,733] {base_task_runner.py:98} INFO - Subtask:     fs, fs_token = get_fs(protocol, options)
    [2018-10-24 22:25:12,735] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/bytes/core.py", line 473, in get_fs
    [2018-10-24 22:25:12,740] {base_task_runner.py:98} INFO - Subtask:     "Need to install `gcsfs` library for Google Cloud Storage support\n"
    [2018-10-24 22:25:12,741] {base_task_runner.py:98} INFO - Subtask:   File "/usr/local/lib/python2.7/site-packages/dask/utils.py", line 94, in import_required
    [2018-10-24 22:25:12,748] {base_task_runner.py:98} INFO - Subtask:     raise RuntimeError(error_msg)
    [2018-10-24 22:25:12,751] {base_task_runner.py:98} INFO - Subtask: RuntimeError: Need to install `gcsfs` library for Google Cloud Storage support
    [2018-10-24 22:25:12,756] {base_task_runner.py:98} INFO - Subtask:     conda install gcsfs -c conda-forge
    [2018-10-24 22:25:12,758] {base_task_runner.py:98} INFO - Subtask:     or
    [2018-10-24 22:25:12,762] {base_task_runner.py:98} INFO - Subtask:     pip install gcsfs

so I tried to install gcsfs using pypi but got the below airflow error:

{
  insertId:  "17ks763f726w1i"  
  logName:  "projects/xxxxxxxxx/logs/airflow-worker"  
  receiveTimestamp:  "2018-10-25T15:42:24.935880717Z"  
  resource: {…}  
  severity:  "ERROR"  
  textPayload:  "Traceback (most recent call last):
  File "/usr/local/bin/gcsfuse", line 7, in <module>
   from gcsfs.cli.gcsfuse import main
  File "/usr/local/lib/python2.7/site- 
    packages/gcsfs/cli/gcsfuse.py", line 3, in <module>
     fuse import FUSE
    ImportError: No module named fuse
 "  
  timestamp:  "2018-10-25T15:41:53Z"  
}

seems that it is trapped in a loop of required packages!! not sure if I missed anything here? any thoughts?

MT467
  • 668
  • 2
  • 15
  • 31
  • ??? why neg point?? – MT467 Nov 01 '18 at 16:16
  • This seems familiar.... what is the command that leads to the error shown? – mdurant Nov 01 '18 at 19:12
  • For reference, fresh environment with py2 or 3, `pip install gcsfs` works fine, without need to explicitly install requirements first. – mdurant Nov 01 '18 at 19:22
  • @mdurant I posted a more general question related to dask. how to use pip on google composer? I dont want to directly install it on google composer VM though – MT467 Nov 01 '18 at 21:33
  • Right, but we can't help you if we don't know what command if causing the error, and so we can't reproduce it ourselves. – mdurant Nov 01 '18 at 21:37
  • @mdurant I used google composer ui (pypi) to install gcsfs. google composer coudlnt install it and threw an error! as simple as it seems to be but not working for me. I am using composer-1.0.0-airflow-1.9.0. – MT467 Nov 05 '18 at 16:43
  • "we don't know what command is causing the error" - still don't know. `pip install gcsfs` does not lead to `ImportError: No module named fuse`. – mdurant Nov 05 '18 at 16:45

1 Answers1

0

You don't need to add storage in your PyPi packages, it's already installed. I ran a dag (image-version:composer-1.3.0-airflow-1.10.0) logging the version of the pre-installed package and it appears that it is 1.13.0. I also added in my dag the following, in order to replicate your case:

import dask.dataframe as dd
def read_csv_dask():
    df = dd.read_csv('gs://gcs_path/data.csv')
    logging.info("csv from gs://gcs_path/ read alright")

Before anything, I added via the UI the following dependencies:

dask==0.20.0
toolz==0.9.0
partd==0.3.9
cloudpickle==0.6.1

The corresponding task failed with the same message as yours ("Need to install gcsfs library for Google Cloud Storage support") at which point I returned to the UI and attempted to add gcsfs==0.1.2. This never succeeded. However, I did not get the error you did, I instead repeatedly failed with "Composer Backend timed out".

At this point, you could consider the following alternatives:

1) Install gcsfs with pip in a BashOperator. This is not optimal as you will be installing gcsfs every time the dag is ran.

2) Use another library. What are you doing with this csv? If you upload it to the gs://composer_gcs_bucket/data/ directory (check here) you can then read it using e.g. the csv standard lib like so:

import csv
def read_csv():
    f=open('/home/airflow/gcs/data/data.csv', 'rU')
    reader = csv.reader(f)
Lefteris S
  • 1,614
  • 1
  • 7
  • 14