I am try to load data from GCS
through pub\sub
way, and fetch user max level by userid
. The following codes runs well in DirectRunner
but job is hang in CombinePerKey(max)
in dataflow.
Here are the codes
class ParseAndFilterFn(beam.DoFn):
def process(self, element):
text_line = element.strip()
data = {}
try:
data = json.loads(text_line.decode('utf-8'))
if 'user_id' in data and data['user_id'] and 'level' in data and data['level']:
yield {
'user': data['user_id'],
'level': data['level'],
'ts': data['ts']
}
def str2timestamp(t, fmt="%Y-%m-%dT%H:%M:%S.%fZ"):
return time.mktime(datetime.strptime(t, fmt).timetuple())
class FormatFieldValueFn(beam.DoFn):
def process(self, element):
yield {
"field": element[0],
"value": element[1]
}
...
raw_event = (
p
| "Read Sub Message" >> beam.io.ReadFromPubSub(topic=args.topic)
| "Convert Message to JSON" >> beam.Map(lambda message: json.loads(message))
| "Extract File Name" >> beam.ParDo(ExtractFileNameFn())
| "Read File from GCS" >> beam.io.ReadAllFromText()
)
filtered_events = (
raw_event
| "ParseAndFilterFn" >> beam.ParDo(ParseAndFilterFn())
)
raw_events = (
filtered_events
| "AddEventTimestamps" >> beam.Map(lambda elem: beam.window.TimestampedValue(elem, str2timestamp(elem['ts'])))
)
window_events = (
raw_events
| "UseFixedWindow" >> beam.WindowInto(beam.window.FixedWindows(5 * 60))
)
user_max_level = (
window_events
| 'Group By User ID' >> beam.Map(lambda elem: (elem['user'], elem['level']))
| 'Compute Max Level Per User' >> beam.CombinePerKey(max)
)
(user_max_level
| "FormatFieldValueFn" >> beam.ParDo(FormatFieldValueFn())
)
p.run().wait_until_finish()
Then I put one new zip file to GCS
, then pipeline of dataflow is running , but hang in Compute Max Level Per User
Is there anything I am missing?