0

Here is another one of my bug findings in google dataprep:

When using a sparse dataset as input (one empty row every other row) google dataprep is not able to process any recipes on it.

The transformer page shows all the data in the intitial sample and all recipe transformations are shown as usual. However, when running a job, it returns an empty set.

If one would take a new randomw dataset sample intstead of the intitial sample, it also returns an empty dataset.

If anyone knows details on this issue it would be much obliged!

Cheers, Bram

B Delfos
  • 21
  • 2
  • I would like to reproduce your issue. Could you provide some more details about your dataset and all the recipe transformations. Could you also share the corresponding Dataflow job ID? Some screenshot in the question would also be appreciated. Thanks! – Xiaoxia Lin Apr 10 '18 at 16:03
  • Hee Xiaoxia! The dataset had around 200000 rows with data, with one empty row every other row. Which means that the final row was around row number 400000. No transformations were applied. However, when taking a new sample out of the data the preview turend out empty. Even when running the initial sample the output csv file only had a header.. I will lookup the job ID. – B Delfos Apr 13 '18 at 15:08

1 Answers1

0

I have tried to reproduce the issue without success. But I would still like to share my step-by-step test. Hopefully someone would find it useful.

  1. Writing a script to create a csv file ('sparse_names.csv') with one empty row every other row.

    import csv
    
    with open('sparse_names.csv', 'w') as csvfile:
        fieldnames = ['id', 'first_name', 'last_name', 'other']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
        writer.writeheader()
        for i in range(10000000):
            if i%2==0:
                writer.writerow({'id': i, 'first_name': 'Baked', 'last_name': 'Beans', 'other': 'lululu'})
            else:
                writer.writerow({'id': '', 'first_name': '', 'last_name': '', 'other': ''})
    
  2. Uploading the file to GCS, and adding it to Dataprep from GCS.

  3. In the Initial Sample, I can see the first 658,831 rows.

    enter image description here

  4. Selecting New Sample. Using quick scan to get Random Samples, and here is the output.

    enter image description here

Xiaoxia Lin
  • 736
  • 6
  • 16
  • Thanks for trying to reproduce the issue! That fact that you are not getting the same result makes me wonder if I might accidentally be using an older version of the Dataflow SDK... In my first initial sample dataprep doesn't even show the empty rows and doesn't detect any empty rows. After creating a new sample, or running the set, the output is empty... – B Delfos Apr 20 '18 at 07:52
  • You can check the version of the Dataflow SDK on the Dataflow page, I was using "Google Cloud Dataflow SDK for Java 2.2.0". I'll have a look if you could provide a project ID and a Job ID. – Xiaoxia Lin Apr 23 '18 at 13:10