How do I load gigabytes of data from Google Cloud Storage into a pandas dataframe?

Question

I am trying to load gigabytes of data from Google Cloud Storage or Google BigQuery into pandas dataframe so that I can attempt to run scikit's OneClassSVM and Isolation Forest (or any other unary or PU classification). So I tried pandas-gbq but attempting to run

pd.read_gbq(query, 'my-super-project', dialect='standard')

causes my machine to sigkill it when it's only 30% complete. And I can't load it locally, and my machine does not have enough space nor does it sound reasonably efficient.

I have also tried

from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())

upon I can load 1/10 or 1/5 of my available data, but then my machine eventually tells me that it ran out of memory.

TLDR: Is there a way that I can run my custom code (with numpy, pandas, and even TensorFlow) in the cloud or some farway supercomputer where I can easily and efficiently load data from Google Cloud Storage or Google BigQuery?

Unfortunately, I can't find a `read_gbq()` function, however if you download the file locally, you could try [dask.dataframe](http://dask.pydata.org/en/latest/dataframe.html), which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue. — David Duffrin, Aug 11 '17 at 19:35
@DavidDuffrin I can't download because my machine does not have have enough hard drive space. — Flair, Aug 11 '17 at 20:04
Could you upload the file(s) to [AWS's EMR](https://aws.amazon.com/emr/) and manipulate the data with something like [PySpark](https://spark.apache.org/docs/0.9.0/python-programming-guide.html)? I have used Hadoop in the past for similar "big data" applications. — David Duffrin, Aug 11 '17 at 20:08
Is [Cloud Dataflow](https://cloud.google.com/dataflow/) an option? Trying to ship the data to AWS doesn't sound like a good solution. — Elliott Brossard, Aug 11 '17 at 20:42
How big is the data? If it's 20GBs, you can start a GCE machine with lots of memory and download it there. If it's 1GB, you need to consider a different option from loading the whole thing into memory (which pandas requires) — Maximilian, Aug 11 '17 at 22:35
Then you need to use something other than pandas / in memory store, unless you want to deal with material complexity — Maximilian, Aug 13 '17 at 01:28
I don't know those algos; you need to find implantations that can operate on large datasets out of memory — Maximilian, Aug 13 '17 at 18:04

score 3 · Answer 1 · answered Aug 28 '17 at 17:58

I don't quite think you are going in the right direction. I'll try to explain how I usually work with data and hopefully this gives you some insights.

I first tend to work with small datasets by either applying some sampling technique or querying for less days. In this step, it's ok to use pandas or other tools developed for small data and build models, raise some statistics, find moments and so on.

After I get some acquaintance with the data then I start working with Big Data tools.

Specifically, I have a very small Dataproc cluster where I've already setup a jupyter notebook to run pyspark code.

The total memory of your cluster will have to surpass the total memory you are using as input.

Either using pandas or spark dataframes should be straightforward for you, as you can see in this blog post by databricks, spark already offers this feature.

After that, comes implementing the algorithms. Spark already offers some built-in algorithms out-of-the-box, you can play around with them.

If the algorithms you want to implement are not available, you can either issue a request in their repository or build it yourself (you can use Python's Scipy implementation as a guide and transpose it to the spark environment).

Here's an example of how I load data for one of the algorithms I use to build a recommender system for our company:

from pyspark.sql import functions as sfunc
from pyspark.sql import types as stypes

schema = stypes.StructType().add("fv", stypes.StringType()).add("sku", stypes.StringType()).add("score", stypes.FloatType())
train_df = spark.read.csv('gs://bucket_name/pyspark/train_data*.gz', header=True, schema=schema)

Spark will automatically distribute this data across the different workers you have available in your cluster. After that I mainly run queries and map / reduce steps to get correlations between skus.

As far as maintaining your current code, it probably won't scale for big data already. You can nevertheless find lots of resources for combining the power of numpy with spark, as in this example for instance.

How do I load gigabytes of data from Google Cloud Storage into a pandas dataframe?

1 Answers1