1

I have been trying to run a Cloud Data Prep flow which takes files from Google Cloud Storage.

The files on Google Cloud Storage gets updated daily and there are more than 1000 files in the bucket right now. However, I am not able to fetch more than 1000 files from the bucket.

Is there any way to get the data from Cloud Storage? If not, is there any alternative way from which we can achieve this?

Steffen Moritz
  • 7,277
  • 11
  • 36
  • 55
  • 1
    How are you fetching those files? – Miguel Nov 19 '19 at 11:06
  • 1
    After some search, it seems this is a known limitation. I don't know to pass through. – guillaume blaquiere Nov 19 '19 at 12:17
  • It is indeed if he's using the XML API, I took action in a [similar case](https://stackoverflow.com/a/58058079/8905352) a couple of months ago and the only workaround right now would be using the JSON API as I explained the answer. – Miguel Nov 19 '19 at 13:07
  • I am just fetching these files through dataset import page in Google dataprep UI. @Miguel thanks, but could you please elaborate how I can use these API exactly? – Abhinav Kumar Nov 19 '19 at 14:59
  • [Here](https://cloud.google.com/storage/docs/apis) you'll find the documentation for both APIs. However, if you have more than 1000 objects in GCS I recommend you to use the JSON API. – Miguel Nov 20 '19 at 15:33

1 Answers1

0

You can load a large number of files using the + button next to a folder in the file browser. This will load all the files in that folder (or more precisely prefix) when running a job on Dataflow.

Create dataset button There is however a limit when browsing/using the parameterization feature. Some users might have millions of files and searching among all of them is not possible. (as GCS only allow filtering by prefix).

See the limitations on that page for more details: https://cloud.google.com/dataprep/docs/html/Import-Data-Page_57344837

Hugues
  • 328
  • 2
  • 6