0

I am trying to create my first pipleine in dataflow, I have the same code runnign when i execute using the interactive beam runner but on dataflow I get all sort of errors, which are not making much sense to me.

{"timestamp":1589992571906,"lastPageVisited":"https://kickassdataprojects.com/simple-and-complete-tutorial-on-simple-linear-regression/","pageUrl":"https://kickassdataprojects.com/","pageTitle":"Helping%20companies%20and%20developers%20create%20awesome%20data%20projects%20%7C%20Data%20Engineering/%20Data%20Science%20Blog","eventType":"Pageview","landingPage":0,"referrer":"direct","uiud":"31af5f22-4cc4-48e0-9478-49787dd5a19f","sessionId":322371}

Here is my code

from __future__ import absolute_import
import apache_beam as beam
#from apache_beam.runners.interactive import interactive_runner
#import apache_beam.runners.interactive.interactive_beam as ib
import google.auth
from datetime import timedelta
import json
from datetime import datetime
from apache_beam import window
from apache_beam.transforms.trigger import AfterWatermark, AfterProcessingTime, AccumulationMode, AfterCount
from apache_beam.options.pipeline_options import GoogleCloudOptions
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
from apache_beam.options.pipeline_options import StandardOptions
import argparse
import logging
from time import mktime

def setTimestamp(elem):
     from apache_beam import window
     return window.TimestampedValue(elem, elem['timestamp'])

def createTuples(elem):
     return (elem["sessionId"], elem)

def checkOutput(elem):
     print(elem)
     return elem


class WriteToBigQuery(beam.PTransform):
  """Generate, format, and write BigQuery table row information."""
  def __init__(self, table_name, dataset, schema, project):
    """Initializes the transform.
    Args:
      table_name: Name of the BigQuery table to use.
      dataset: Name of the dataset to use.
      schema: Dictionary in the format {'column_name': 'bigquery_type'}
      project: Name of the Cloud project containing BigQuery table.
    """
    # TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
    #super(WriteToBigQuery, self).__init__()
    beam.PTransform.__init__(self)
    self.table_name = table_name
    self.dataset = dataset
    self.schema = schema
    self.project = project

  def get_schema(self):
    """Build the output table schema."""
    return ', '.join('%s:%s' % (col, self.schema[col]) for col in self.schema)

  def expand(self, pcoll):
    return (
        pcoll
        | 'ConvertToRow' >>
        beam.Map(lambda elem: {col: elem[col]
                               for col in self.schema})
        | beam.io.WriteToBigQuery(
            self.table_name, self.dataset, self.project, self.get_schema()))


class ParseSessionEventFn(beam.DoFn):
  """Parses the raw game event info into a Python dictionary.
  Each event line has the following format:
    username,teamname,score,timestamp_in_ms,readable_time
  e.g.:
    user2_AsparagusPig,AsparagusPig,10,1445230923951,2015-11-02 09:09:28.224
  The human-readable time string is not used here.
  """
  def __init__(self):
    # TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
    #super(ParseSessionEventFn, self).__init__()
    beam.DoFn.__init__(self)

  def process(self, elem):
          #timestamp = mktime(datetime.strptime(elem["timestamp"], "%Y-%m-%d %H:%M:%S").utctimetuple())
          elem['sessionId'] = int(elem['sessionId'])
          elem['landingPage'] = int(elem['landingPage'])
          yield elem



class AnalyzeSessions(beam.DoFn):
   def __init__(self):
    # TODO(BEAM-6158): Revert the workaround once we can pickle super() on py3.
    #super(AnalyzeSessions, self).__init__()
    from apache_beam import window    
    beam.DoFn.__init__(self)

  def process(self, elem, window=beam.DoFn.WindowParam):
          from apache_beam import window
          sessionId = elem[0]
          uiud = elem[1][0]["uiud"]
          count_of_events = 0
          pageUrl = []
          window_end = window.end.to_utc_datetime()
          window_start = window.start.to_utc_datetime()
          session_duration = window_end - window_start
          for rows in elem[1]:
             if rows["landingPage"] == 1:
                    referrer = rows["refererr"]
             pageUrl.append(rows["pageUrl"])       
          print({
             "pageUrl":pageUrl,
             "eventType":"pageview",
             "uiud":uiud,
             "sessionId":sessionId,
             "session_duration": session_duration,
              "window_start" : window_start
               })          
          yield {
             'pageUrl':pageUrl,
             'eventType':"pageview",
             'uiud':uiud,
             'sessionId':sessionId,
             'session_duration': session_duration,
             'window_start' : window_start,
               }

def run(argv=None, save_main_session=True):
    parser = argparse.ArgumentParser()
    parser.add_argument('--topic', type=str, help='Pub/Sub topic to read from')
    parser.add_argument(
          '--subscription', type=str, help='Pub/Sub subscription to read from')
    parser.add_argument(
          '--dataset',
          type=str,
          required=True,
          help='BigQuery Dataset to write tables to. '
          'Must already exist.')
    parser.add_argument(
          '--table_name',
          type=str,
          default='game_stats',
          help='The BigQuery table name. Should not already exist.')
    parser.add_argument(
          '--fixed_window_duration',
          type=int,
          default=60,
          help='Numeric value of fixed window duration for user '
          'analysis, in minutes')
    parser.add_argument(
          '--session_gap',
          type=int,
          default=5,
          help='Numeric value of gap between user sessions, '
          'in minutes')
    parser.add_argument(
          '--user_activity_window_duration',
          type=int,
          default=30,
          help='Numeric value of fixed window for finding mean of '
          'user session duration, in minutes')
    args, pipeline_args = parser.parse_known_args(argv)
    session_gap = args.session_gap * 60
    options = PipelineOptions(pipeline_args)
    # Set the pipeline mode to stream the data from Pub/Sub.
    options.view_as(StandardOptions).streaming = True

    options.view_as( StandardOptions).runner= 'DataflowRunner'
    options.view_as(SetupOptions).save_main_session = save_main_session
    p = beam.Pipeline(options=options)
    lines = (p
                | beam.io.ReadFromPubSub(
              subscription="projects/phrasal-bond-274216/subscriptions/rrrr")
             | 'decode' >> beam.Map(lambda x: x.decode('utf-8'))
             | beam.Map(lambda x: json.loads(x))
             | beam.ParDo(ParseSessionEventFn())
             )

    next = ( lines
                | 'AddEventTimestamps' >> beam.Map(setTimestamp)
                | 'Create Tuples' >> beam.Map(createTuples)
                | 'Window' >> beam.WindowInto(window.Sessions(15))
                | 'group by key' >> beam.GroupByKey() 
                | 'analyze sessions' >> beam.ParDo(AnalyzeSessions())
                | beam.Map(print)          
               | 'WriteTeamScoreSums' >> WriteToBigQuery(
              args.table_name, args.dataset,
           {

              "uiud":'STRING',
             "session_duration": 'INTEGER',
               "window_start" : 'TIMESTAMP'
                         },
             options.view_as(GoogleCloudOptions).project)
             )



    result = p.run()
#    result.wait_till_termination()

if __name__ == '__main__':
  logging.getLogger().setLevel(logging.INFO)
  run()

The problem I am facing is with the AnalyzeSessions Pardo function it doesn't produce any output or any error. I have tried this code in interactive beam runner and it has worked.

Any ideas why the pipeline is not working?

All the other steps including analyze session step have an input

enter image description here

Here is the output section.

enter image description here

The steps after this do not work, I don't see anything in the logs either, not event the print statements add something there.

EDIT: IF it helps here is how my data is going into the WindoInto and GroupBy key step.

enter image description here Any ideas on what i should try.

pavneet tiwana
  • 216
  • 1
  • 3
  • 17
  • I'm new to Beam, but tried to find anything in the code that would be the cause of this no-output. I'd recommend starting from a single PTransform and make it working, add another transform until you find out what the problem is. – Jacek Laskowski Jun 06 '20 at 14:07
  • Already did that the steps before that are working , but will give it another shot – pavneet tiwana Jun 06 '20 at 17:57

2 Answers2

0

I've only used Beam with the Java SDK, but the process function generally does not return. It calls a Beam-supplied callback to output results. This is so that you can take one input and return any number of outputs.

In your example, AnalyzeSessions#process has a return statement. Looking at Beam examples I see a yield statement in the process function of a DoFn. Try yield? Is that Python's version of an output callback? https://beam.apache.org/get-started/wordcount-example/#specifying-explicit-dofns

Alec
  • 479
  • 4
  • 6
0

It looks like the problem here is before your AnalyzeSessions ParDo; it's not receiving any input according to your screenshots.

From your pipeline code, the most likely step to be delayed in the GroupByKey, which is also performing the sessionization logic. There are two things that could be causing delay there. First, windows (e.g. your sessions) will only close when the input watermark passes the end of the window. If the pipeline cannot keep up with its input, the watermark will not advance. Does your pipeline have high data freshness and/or a large backlog in Pubsub? If so, try using more workers, or run on a subset of your data.

Second, if your sessions never finish, you won't get any output. This can happen if for every session there is a constant stream of elements such that the gap duration is never reached.

EDIT: Looks like the problem is with how you are setting timestamps. In general when using PubsubIO, if you want custom event timestamps, you should put the timestamp in an attribute on the messages, and then set the timestamp_attribute parameter of ReadFromPubsub to point at that attribute. This allows Dataflow to understand the timestamps in order to produce correct watermarks for the pipeline.

With what you were doing, Dataflow's view of the event timestamps was the default, which is the publish time of the message. You then moved the event timestamps backwards, which caused the elements to be "late". This caused them to be dropped in the GroupByKey.

danielm
  • 3,000
  • 10
  • 15
  • I tried to follow your advice, create a new subscription and topic, and made sure sessions where created, even changed to a fixed window, but any step i put after the group by key doesn't work. Any other ideas. doi you have some example code i can look at ? – pavneet tiwana Jun 09 '20 at 16:30
  • Do you see a high data freshness for your pipeline? If so, that indicates the watermark is not advancing because your pipeline can't keep up with the input – danielm Jun 09 '20 at 17:11
  • What you mean by data freshness, i am sending the following data, the time stamp is different, but the other data I am just sending on a loop. I provided the example in my code. what do I need to make the timestamp advance? – pavneet tiwana Jun 09 '20 at 17:30
  • I see now my pub/sub messages are not getting acknowledged, can that be a part of the problem? – pavneet tiwana Jun 09 '20 at 17:34
  • Yeah, messages have to be acknowledged back to Pubsub for the watermark to advance; that is the cause of your problem. Can you tell why they aren't? – danielm Jun 09 '20 at 17:57
  • The messages are being acknowledged, I made sure fresh data is coming in, i purged all old messages, went to the fixed window, The groupby step has output but the next step i put, which is a simple print step, it is a map function that print(elem) and then return the same elem, still has no output. – pavneet tiwana Jun 09 '20 at 20:19
  • I added the image of the tuples that are being fed into the windowInto and GroupByKey function, No output on the other side. If you see all messages have different timestamps. – pavneet tiwana Jun 09 '20 at 21:56
  • So I think finally figured it out, the issue was with my own event timestamp for some reason it was not working, I switched to time.time() and it is working, do you know what is wrong with the logic I was using initially. – pavneet tiwana Jun 09 '20 at 22:29
  • Edited my answer to explain what's going on – danielm Jun 10 '20 at 01:39