1

I am a new user of google's Datalab. I am evaluating the possibility of using the Datalab for a production data pipeline. This means I want to be able to run my data crunching as a python program, not as an interactive notebook. Datalab seems to be designed for interactive jupyter notebook usage. But I remember seeing a screen during the signup process that said users can run their entire data pipelines. But all documentation speaks only about interactive data analysis, no mention of programmatic data analysis. I want to also run the script in a cron job periodically. So I am looking for two things: 1) How do I run a python script on Datalab? 2) How can I run it periodically in a cronjob?

It would be very helpful if anyone can shed some light. thanks in advance!

  • It is broad area. Please ask some specific problem which you have tried or falling any kind of problem. – SkyWalker Mar 10 '16 at 02:26

2 Answers2

1

It is not because something is (technically) possible that it becomes a good idea.

As @Anthonios mentioned:

it is not possible to customize datalab with extra python modules through a supported method. 

Your requirements can be easily achieved by combining other (Google) cloud 'building blocks'.

Example 1, for streaming data:

  • PubSub > DataFlow[1] > Cloud Storage or BigQuery > DataLab[2]

Example 2, scheduled batch processing:

  • Start Docker Container (scheduled) > Container runs your processing scripts & stops when finished > Cloud Storage, Cloud SQL or BigTable > DataLab[2]

There is no single solution when drawing a cloud architecture, it all depends on your use-case.

But your current architecture (although possible) seems like abusing DataLab ... Unless you have a REALLY strong argument to do so ...


  1. Python API in the making

  2. DataLab only needed if an end-user needs to peform interactive data analysis.

0

As answered in this stackoverflow post, it is not possible to customize datalab with extra python modules through a supported method. My suggestion would be to install the python script/cron job in another system outside of datalab, as you would with any python script that you want to run unrelated to datalab.

Really Long Side Note: If you have to run the program within the datalab container because you want to make use of the datalab specific gcp libraries, then I propose the following unsupported (yet creative) setup that has worked for me. However, it involves running a local datalab container, as well as a cloud datalab container.

  1. Install datalab locally
  2. Append the following to the file Dockerfile.in file at

$REPO_DIR/containers/datalab/Dockerfile.in

# Add a custom script which calls a custom program (python file)
ADD mycustomprogram.sh /usr/local/bin/mycustomprogram.sh

# Allow the script to be executed
RUN chmod +x /usr/local/bin/mycustomprogram.sh
  1. Modify the ENTRYPOINT variable in $REPO_DIR/containers/datalab/run.sh to point to your custom script

Now you have a custom script running inside the datalab local container.

With the local setup, you can still commit to the same Google hosted git repository using any git client from your host machine. gcloud has a simple prompt that will guide you through the process of cloning the Google hosted git repository.

Simply run gcloud init.

After signing in, you should see the following prompt which asks you whether you want to use a Google hosted repository:

Do you want to use Google's source hosting (Y/n)?

IMPORTANT NOTE: This is only a temporary work around while we wait for additional datalab customization options. I would much prefer to edit the cloud Dockerfile.in file , rather than deploy a local datalab instance, in order to install a custom python program.

Community
  • 1
  • 1
Anthonios Partheniou
  • 1,699
  • 1
  • 15
  • 25