I am trying to load gigabytes of data from Google Cloud Storage or Google BigQuery into pandas dataframe so that I can attempt to run scikit's OneClassSVM and Isolation Forest (or any other unary or PU classification). So I tried pandas-gbq
but attempting to run
pd.read_gbq(query, 'my-super-project', dialect='standard')
causes my machine to sigkill it when it's only 30% complete. And I can't load it locally, and my machine does not have enough space nor does it sound reasonably efficient.
I have also tried
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket-id-here')
# Then do other things...
blob = bucket.get_blob('remote/path/to/file.txt')
print(blob.download_as_string())
upon I can load 1/10 or 1/5 of my available data, but then my machine eventually tells me that it ran out of memory.
TLDR: Is there a way that I can run my custom code (with numpy, pandas, and even TensorFlow) in the cloud or some farway supercomputer where I can easily and efficiently load data from Google Cloud Storage or Google BigQuery?