1

I'm trying to automate some data cleaning tasks by uploading the files to Cloud Storage, running them through a pipeline, and downloading the results.

I have created the template for my pipeline to execute using the GUI in Dataprep, and am attempting to automate the upload and execution of the template using the Google Client Libraries, specifically in Python.

However, I have found that when running the job with the Python script, the full template is not executed; sometimes some of the step aren't completed, sometimes the output file - which should be MegaBytes large - is less than 500 bytes. This is dependent on the template I use. Each template has its own issue.

I've tried breaking the large template into smaller templates to apply consecutively so I can see where the issue is, but that is where I discovered that each template has it's own issue. I have also tried creating the job from the Dataflow Monitoring Interface, and have found that anything created with that will run perfectly, meaning that there must be some issue with the script I've created.

def runJob(bucket, template, fileName):
    #open connection with the needed credentials
    credentials = GoogleCredentials.get_application_default()
    service = build('dataflow', 'v1b3', credentials = credentials)

    #name job after file being processed
    jobName = fileName.replace('.csv', '')
    projectId = 'my-project'

    #find the template to run on the dataset
    templatePath = "gs://{bucket}/me@myemail.com/temp/{template}".format(bucket = bucket, template=template)
    #construct job JSON 
    body = {
        "jobName":"{jobName}".format(jobName=jobName),
        "parameters" : {
            "inputLocations" :"{\"location1\":\"gs://" + bucket  + "/me@myemail.com/RawUpload/" + fileName + "\"}",
            "outputLocations":"{\"location1\":\"gs://" + bucket  + "/me@myemail.com/CleanData/" + fileName.replace('.csv', '_auto_delete_2') + "\"}",

        },
        "environment" : {
            "tempLocation":"gs://{bucket}/me@myemail.com/temp".format(bucket = bucket),
            "zone":"us-central1-f"
        }
    }
    #create and execute HTTPRequest
    request = service.projects().templates().launch(projectId=projectId, gcsPath=templatePath, body=body)
    response = request.execute()
    #notify user
    print(response)

Using the JSON format, my input to the parameters is that same as when I use the Monitoring Interface. This tells me that there is either something going on in the background of the Monitoring Interface that I am unaware of and thus am not including, or there is an issue with the code that I have created.

As I said above, the issue varies depending on the template I try to run, but the most common is the extremely small output file. The output file will be magnitudes smaller than it should be. This is because it will contain only the CSV headers and some random samples of the first row within the data and is also formatted incorrectly for a CSV file in the first place.

Does anyone know what I'm missing or recognize what I'm doing wrong?

skeetman
  • 61
  • 1
  • 7

0 Answers0