1

I am trying to create a dataflow job with the custom template and getting an error as "Runnable workflow has no steps specified." The log has no info except this. Am I missing any steps?

Have created a virtual env and executed the below code.

I'm almost close to it. Any help on the error is appreciated.

The code is:

import sys
import os
import apache_beam as beam
import google.cloud.logging
import google.auth
from google.cloud import storage
import pandas as pd
import io
import gcsfs as gcs
from io import BytesIO
import re
from datetime import date
import datetime
import logging
import argparse
from pyspark.context import SparkContext
from pyspark import SparkConf
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
from py4j.java_gateway import java_import
from pyspark.sql.functions import udf
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
import urllib
import urllib.request
from urllib.request import urlretrieve
#import requests,zipfile
#from zipfile import zipFile
#from zipfile import is_zipfile
from zipfile import ZipFile, is_zipfile


date = datetime.datetime.now()
#monthname = date.strftime("%B")
monthname = 'September'
#monthno = date.strftime("%m")
monthno = '9'
yearname = date.strftime("%Y")

print(monthname)
print(monthno)
print(yearname)

url = 'https://www.abc.gov/files/zip/statecontract-'+monthname+'-'+yearname+'-employee.zip'

destination_zip_name = 'upload.zip'

def upload_blob(bucket_name, url, destination_zip_name,argv=None, save_main_session=True):

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        help='Input file to process.')
    parser.add_argument(
        '--output',
        dest='output',
        help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session

    with beam.Pipeline(options=pipeline_options) as p:
        storage_client = storage.Client()
        source_bucket = storage_client.get_bucket(bucket_name)
        print('source bucket - ',source_bucket)
        destination_bucket_name = storage_client.get_bucket(bucket_name) #destination_bucket_name
        print('destination bucket - ',destination_bucket_name)


        my_file = urllib.request.urlopen(url)
        blob1 = source_bucket.blob(destination_zip_name)
        blob1.upload_from_string(my_file.read(), content_type='application/zip')

        destination_blob_pathname = destination_zip_name
        print('destination_blob_pathname - ',destination_blob_pathname)

        blob = source_bucket.blob(destination_blob_pathname)
        zipbytes = io.BytesIO(blob.download_as_string())
        

        if is_zipfile(zipbytes):
            with ZipFile(zipbytes, 'r') as myzip:
                for contentfilename in myzip.namelist():
                    contentfile = myzip.read(contentfilename)
                    #print('contentfile - ',contentfile)
                    
                    # unzip pdf files only, leave out if you don't need this.
                    if '.csv' in contentfilename.casefold():
                        output_file = f'/tmp/{contentfilename.split("/")[-1]}'
                        print('output_file - ',output_file)
                        outfile = open(output_file, 'wb')
                        outfile.write(contentfile)
                        outfile.close()

                        blob = source_bucket.blob(
                            f'{destination_zip_name.rstrip(".zip")}/{contentfilename}'
                        )
                        with open(output_file, "rb") as my_csv:
                            blob.upload_from_file(my_csv)

                        

        blob1.delete()                
        print('done running function')


if __name__ == '__main__':
    upload_blob('testbucket', url, destination_zip_name)
  • Why do you need Apache Beam ? I don't really understand what you are trying to do but your Pipeline `p` is not used and you don't have any PCollections initialized. If you want to read some files from Google Cloud Storage with Apache Beam, you should use the IO provided, for example `apache_beam.io.gcp.gcsio.GcsIO.open("gcs://yourbucket/folder")` – Dev Yns Oct 13 '22 at 01:43
  • Thank you for your response. Actually, I'm new to GCP dataflow and have built the script as per google docs. If I go and make changes as per your suggestion, it will take time and may mess up with the code. I would appreciate it if you can make the changes in the above code and revert. – Abhishek Boga Oct 13 '22 at 04:11
  • What the code does is it downloads a zip file from the provided website into a gcs bucket. It unzips and extracts only the CSV file and then copies it into the gcs bucket. After copying, it deletes the zip file. That's it. – Abhishek Boga Oct 13 '22 at 04:28
  • Your job is it long in response time ? – Mazlum Tosun Oct 13 '22 at 06:38
  • @MazlumTosun Didn't get your question. Basically there is no issue with the job response time. – Abhishek Boga Oct 13 '22 at 14:22
  • Can someone please update me on the issue? Thank you in advance. – Abhishek Boga Oct 13 '22 at 14:57
  • Agree with other answers here. If you actually need to run a Beam pipeline to process your data, you need to read your data into a PCollection and process using Beam PTransforms. Please see here for some Beam Python example pipelines - https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples – chamikara Oct 13 '22 at 20:46
  • Yes try to use other service that `Beam` in this case. Can you use other service that `Beam` ? – Mazlum Tosun Oct 14 '22 at 10:19

1 Answers1

0

I think you don’t need to use Apache Beam for your need. You don’t even use the Pipeline variable p and you don’t have any PCollection initialized.

Instead, you should put your code in a cloud function and run it. It should work.

Dev Yns
  • 181
  • 1
  • 5
  • How to make use of pipeline variable p and PCollection in the code to make it work? I agree I can put it in a cloud function and run it, but that's not the scope of the req. I think if I utilize and use p and PCollection it should work. – Abhishek Boga Oct 14 '22 at 14:33
  • You said that your code "downloads a zip file from the provided website into a gcs bucket. It unzips and extracts only the CSV file and then copies it into the gcs bucket. After copying, it deletes the zip file.", then what whould be your data in your PCollection ? You are trying to manage files in GCS then unzip it and copy only PDF in GCS and finally delete the zip. It is out of scope of Apache Beam use. – Dev Yns Oct 14 '22 at 14:54
  • Actually, my goal was to create a dataflow job with Python script. I have written the logic as mentioned above and it ran properly in dataflow under a virtual environment. Later I faced an issue while creating a dataflow job. The error was "Unable to parse template file 'gs://ai-datascience/script/gcs_orc_snowflake.py'. I google for a resolution and came across many posts where it has mentioned apache beam code. – Abhishek Boga Oct 14 '22 at 20:53
  • I even raises an issue tracker with google and they also mentioned to refer WordCount example and follow the entire structure of the code. This is because I have added apache beam code and tried a resolution but no luck. So basically what should code so that it should create a dataflow job? – Abhishek Boga Oct 14 '22 at 20:56