0

I've been stuck for a few days. So my problem is, i create data pipeline using apache beam and dataflow runner. I use a global variable (a dictionary) in the script to be accessed by some function. The problem is, when i run it in local with estimated 200.000 rows of data it succeeded both in local and dataflow. But when i run it in dataflow using the dataset contains 6.000.000 rows, the dictionary became empty. Here is my code:

The function:

global pre_compute
pre_compute = {} # {(transnumber,seq):[dordertxt, dorderupref], (transnumber,seq):[dordertxt, dorderupref]}

def compute_all_upref_and_ordertxt(data):
    '''
    Compute all dorder_txt and dorder_upref
    '''
    trans_number = data.get("transaction_number")
    seq = data.get("stdetail_seq")

    # get and remove ordertxt and upref from data
    ordertxt = data.pop("dorder_ordertxt","")
    upref = data.pop("dorder_upref","")

    
    global pre_compute
    if pre_compute.get((trans_number,seq), None) == None:
        pre_compute[(trans_number, seq)] = [ordertxt, upref]

    else:
        if ordertxt:
            pre_compute[(trans_number, seq)][0] = ordertxt
        if upref:
            pre_compute[(trans_number, seq)][1] = upref

    return data # -> data with no upref and ordertxt

def evaluate_and_inject_upref_ordertxt(data):
    # Using json.loads() faster 4-6x than eval()
    data = data.strip("\n")
    data = data.replace("'", '"')
    data = data.replace("None", "null")
    data = json.loads(data) # str to dict
    
    
    trans_number = data.get('transaction_number')
    seq = data.get('stdetail_seq')

    global pre_compute
    ordertxt, upref = pre_compute[(trans_number, seq)]
    data['dorder_ordertxt'] = ordertxt
    data['dorder_upref'] = upref
    return data

The pipeline code:

left_join_std_dtdo = (join_stddtdo_dict | 'Left Join STD DTable DOrder' >> Join(left_pcol_name=stdbsap_dimm_data, left_pcol=left_join_std_bsap,
                                                                    right_pcol_name=dtdo_data, right_pcol=left_join_dtdo,
                                                                    join_type='left', join_keys=join_keys)
                                            | 'UPDATE PRICE FOR SCCRM01' >> beam.ParDo(update_price_sccrm01())
                                            | 'REMOVE PRICE from DICTIONARY' >> beam.ParDo(remove_dtdo_price())
                                            | 'PreCompute All Upref and ordertxt based on trans_number and seq' >> beam.Map(compute_all_upref_and_ordertxt)
    )

rm_left_std_dtdo = (left_join_std_dtdo | 'CHANGE JOINED STD DTDO INTO STR' >> beam.Map(lambda x: str(x))
                                             | 'DISTINCT STD DTDO' >> beam.Distinct()
                                             | 'EVALUATE AND INJECT AS DICT STD DTDO' >> beam.Map(evaluate_and_inject_upref_ordertxt)
                                             | 'Adjust STD_NET_PRICE WITH DODT_PRICE' >> beam.ParDo(replaceprice())
    ) 

It runs perfectly in both local and dataflow with 200.000 rows of data. But when i tried using 6.000.000 rows of data in dataflow, when the script is executing

ordertxt, upref = pre_compute[(trans_number, seq)]

it always give me a key error when running in dataflow, like the dictionary is empty. Any solutions?

  • You can't have global variables in Beam code. Check [this](https://stackoverflow.com/questions/44432556/is-there-anyway-to-share-stateful-variables-in-dataflow-pipeline) answer. – Praneeth Peiris Aug 18 '21 at 09:18
  • Thanks, i have read that link before. Why am i confused is if i use the 200.000 data rows and using global variable, it runs perfectly? – arroganthooman Aug 18 '21 at 09:51
  • It's probably because it's using only a single worker for a smaller number of records. I don't see a straightforward way to have a global data structure. You probably have to re-think your pipeline (as suggested in the answer). – Praneeth Peiris Aug 18 '21 at 15:34

2 Answers2

2

You can try using Beam state API. Note that state API is not designed to store large a large amounts of data.

Another option might be storing your data in an external storage system (for example, GCS) so that all workers have access to that data.

Note that either solution could limit the parallelization (and hence performance) of your pipeline if you try to store large amounts of data. In such a case it might be better to redesign your pipeline to be truly parallelizable.

chamikara
  • 1,896
  • 1
  • 9
  • 6
0

apache beam is building upon the assumption to run on distributed infrastructure. nodes will run independently, any state would have to be shared between workers. therefore, global variables are not available. if you really require to exchange information across workers, you'll probably have to implement yourself. However, I'd rather recommend to overthinking the pipeline.

Andreas Neumeier
  • 338
  • 2
  • 10