0

I'm using google cloud datalab for my ML project. One of my data is in the bigquery table that has millions of records (text data) with many columns. I created a pandas dataframe from the bigquery table, converted it to a dask dataframe (with 5 partitions) and performed data wrangling.

Now I have this dask dataframe that I want to store it in bigquery or convert it into parquet files and store them in my gcp storage. It would be great to hear options from the community. Thanks.

  • 2
    could you elaborate where you are stuck now? As I understood your question you got a big df and want to store it in cloud or files etc., I think it falls under opinion based question. – Mohamed Thasin ah Feb 14 '19 at 10:12
  • Yes you are right. it is a big (dask) df and I want to store it in google cloud storage. I'm new to asking questions in this community, so i'm not sure about opinion based question. – Eswara Babu Feb 14 '19 at 10:52
  • Stack overflow is for developers who gets stuck in any programming language. Here we are help to solve or guide where you faced programming issues. In your problem if you tried to upload your file to cloud and you got stuck at some stage then you can raise the question by explaining your problem, what have you tried, where you get stuck etc., before raising a question read this link https://stackoverflow.com/help/how-to-ask. it will be helpful to you, have a good day :) – Mohamed Thasin ah Feb 14 '19 at 11:01

1 Answers1

0

As the comments mention, this is too much of a "how do I..." question.

However, the simple answer is

df.to_parquet('gcs://mybucket/mypath/output.parquet')

You will need one of the parquet backends installed (fastparquet or pyarrow) and gcsfs. Additional parameters for gcsfs may be required to get the right permissions, using the keyword storage_options={...}, see the gcsfs docs.

General information: http://docs.dask.org/en/latest/remote-data-services.html

mdurant
  • 27,272
  • 5
  • 45
  • 74