0

I am currently sending StackDriver log files for my app to a BigQuery table. I would like to strip down the dataset and place it into a new BigQuery table to be queried later and render those results in a view on my app. I will be using python as my main language as I do not know Java, and creating a CRON job to run this script every 15 minutes to populate the new log dataset from StackDriver.

Striping down the dataset takes on two processes: 1.) Only write some of the columns from the original BigQuery table to the new one 2.) Create a subset of the data in certains columns to be written into new columns in the new BigQuery table. For example:

A row in the original BigQuery table will contain the string

Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3

I would like to strip out iPad and place this into a devices column, AppleWebKit and place this into a browsers column, etc, in the new BigQuery table.

I know I can load the bigquery libraries into python to query the orignal BigQuery table but how do I strip out what I want and write that to a new table? Would this be a good use case for pandas? Is there an easier way to accomplish this task then my current idea?

Prof. Falken
  • 499
  • 6
  • 21
  • One possible downside to this approach (I suppose) is that every 15 mins you'd have to query over the whole table again. If it's a small table then probably it's not a big deal but if it has hundred of gigas then your expenses might increase too much. Have you also considered exporting the logs to Pub/Sub and using Dataflow to process from there? Maybe this last approach will end up being cheaper (not necessarily easier though and I'm not sure if this approach is a good one as well, just throwing my 2 cents). – Willian Fuks Nov 14 '17 at 15:28
  • @willianFuks Pub/Sub is an option. To be honest, I am open to any "better" ways. Expenses may not be that much of a problem if they do not increase that much which is something that can be monitored during test. Are their any good use case documents or examples you could send? – Prof. Falken Nov 14 '17 at 15:56
  • @WillianFuks I just found this article in the gcloud docs https://cloud.google.com/appengine/docs/standard/python/logs/#reading_logs_in_the_console I was thinking I could just read the log files into pandas, transform the data there and then use pandas.DataFrame.to_gbq to write to BigQuery. This means I should be able to retrieve the raw logs every 15 mins and reduce the cost. Does this sound reasonable? – Prof. Falken Nov 15 '17 at 14:09
  • I'm not sure if using pandas is a good idea for processing data, while the tool is very good at analysis for small data it won't scale properly for big data. I'd use dataflow instead as if the processing requirements increases so will the amount of processors allocated to run the job (on top of being 100% serverless which I find to be quite valuable and worth aiming for). – Willian Fuks Nov 15 '17 at 14:22
  • @WillianFuks I am new to gcloud and have seen the Dataflow documentation all be it a bit confused on how the process works. The main objective I would like to achieve is getting log data generated by the app into bigquery so I begin to write a reports section of my app that queries that cleaned log data. Would this be an easy setup with DataFlow? – Prof. Falken Nov 15 '17 at 14:32
  • I'm not sure, I do know that learning dataflow would require learning several new concepts. Maybe one thing you could do is scheduling this process to run first in BQ and see how expenses goes. If it's not that much then it'd be a good solution already (and if it turns out to be prohibitively expensive then you might consider exporting stackdriver data to pub/sub and then using dataflow to filter properly the data you'll be sending to BQ. – Willian Fuks Nov 15 '17 at 14:42
  • @WillianFuks I will look into that solution. Thanks for your comments! – Prof. Falken Nov 15 '17 at 14:47

0 Answers0