I've been stuck for a few days. So my problem is, i create data pipeline using apache beam and dataflow runner. I use a global variable (a dictionary) in the script to be accessed by some function. The problem is, when i run it in local with estimated 200.000 rows of data it succeeded both in local and dataflow. But when i run it in dataflow using the dataset contains 6.000.000 rows, the dictionary became empty. Here is my code:
The function:
global pre_compute
pre_compute = {} # {(transnumber,seq):[dordertxt, dorderupref], (transnumber,seq):[dordertxt, dorderupref]}
def compute_all_upref_and_ordertxt(data):
'''
Compute all dorder_txt and dorder_upref
'''
trans_number = data.get("transaction_number")
seq = data.get("stdetail_seq")
# get and remove ordertxt and upref from data
ordertxt = data.pop("dorder_ordertxt","")
upref = data.pop("dorder_upref","")
global pre_compute
if pre_compute.get((trans_number,seq), None) == None:
pre_compute[(trans_number, seq)] = [ordertxt, upref]
else:
if ordertxt:
pre_compute[(trans_number, seq)][0] = ordertxt
if upref:
pre_compute[(trans_number, seq)][1] = upref
return data # -> data with no upref and ordertxt
def evaluate_and_inject_upref_ordertxt(data):
# Using json.loads() faster 4-6x than eval()
data = data.strip("\n")
data = data.replace("'", '"')
data = data.replace("None", "null")
data = json.loads(data) # str to dict
trans_number = data.get('transaction_number')
seq = data.get('stdetail_seq')
global pre_compute
ordertxt, upref = pre_compute[(trans_number, seq)]
data['dorder_ordertxt'] = ordertxt
data['dorder_upref'] = upref
return data
The pipeline code:
left_join_std_dtdo = (join_stddtdo_dict | 'Left Join STD DTable DOrder' >> Join(left_pcol_name=stdbsap_dimm_data, left_pcol=left_join_std_bsap,
right_pcol_name=dtdo_data, right_pcol=left_join_dtdo,
join_type='left', join_keys=join_keys)
| 'UPDATE PRICE FOR SCCRM01' >> beam.ParDo(update_price_sccrm01())
| 'REMOVE PRICE from DICTIONARY' >> beam.ParDo(remove_dtdo_price())
| 'PreCompute All Upref and ordertxt based on trans_number and seq' >> beam.Map(compute_all_upref_and_ordertxt)
)
rm_left_std_dtdo = (left_join_std_dtdo | 'CHANGE JOINED STD DTDO INTO STR' >> beam.Map(lambda x: str(x))
| 'DISTINCT STD DTDO' >> beam.Distinct()
| 'EVALUATE AND INJECT AS DICT STD DTDO' >> beam.Map(evaluate_and_inject_upref_ordertxt)
| 'Adjust STD_NET_PRICE WITH DODT_PRICE' >> beam.ParDo(replaceprice())
)
It runs perfectly in both local and dataflow with 200.000 rows of data. But when i tried using 6.000.000 rows of data in dataflow, when the script is executing
ordertxt, upref = pre_compute[(trans_number, seq)]
it always give me a key error when running in dataflow, like the dictionary is empty. Any solutions?