0

Here is a line from tomcat access log:

127.0.0.1 - - [24/May/2016:17:53:05 -0700] "POST /users HTTP/1.1" 200 10676

Is there a way to parse out the various fields here like client IP, HTTP request method, request path, response code, etc., and load it into separate columns in a BigQuery table?

The table at the bottom of this page links to the fluent catch-all config, but I believe different logs should be parsed out and loaded differently in a configurable manner for ease of querying?

Or am I missing something fundamental here?

arun
  • 10,685
  • 6
  • 59
  • 81
  • This is less of an answer than just converting one problem into another, but tomcat logs showing up as textPayload instead of being structured by fluentd seems like the root problem here. I don't know if that means you're not set up with fluentd correctly or if the fluentd tomcat config is just suboptimal, but that's a thread you might pull on. Fixing that seems likely to solve your problem. – Ben Rister Jun 03 '16 at 13:02

2 Answers2

3

it might be not what you mean - but just guess :

How about loading log into GBQ table so each log line becomes row in table and than parse it into another table with something like below (code is not pretending to be optimal - just to show idea)

SELECT 
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){0} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){1} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){2} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){3} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){4} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){5} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){6} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){7} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){8} (.+?) '),
  REGEXP_EXTRACT(log_line, r'(?: (?:.+?)){9} (.+?) '),
FROM (
  SELECT ' ' + REGEXP_REPLACE(log_line, r'[\[\]\"]', '') + ' ' AS log_line 
  FROM 
    (SELECT '127.0.0.1 - - [24/May/2016:17:53:05 -0700] "POST /users HTTP/1.1" 200 10676' AS log_line)
)
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Yes, this seems to be the approach, unless there is a way to configure the regex in fluent. I will wait for couple more days to see if a better answer comes along before accepting yours. – arun Jun 02 '16 at 22:21
  • sure. i am not user of fluent so don't know if it is :o) – Mikhail Berlyant Jun 02 '16 at 22:22
0

Can you just import it as a "CSV"? BigQuery allows you to specify custom delimiters and quote characters.

127.0.0.1 - - [24/May/2016:17:53:05 -0700] "POST /users HTTP/1.1" 200 10676

It looks like you could provide a single space as a quote character and double-quote as the (optional) quote. I would expect the example line above to parse into:

"127.0.0.1", "-", "-", "[24/May/2016:17:53:05", "-0700]", "POST /users HTTP/1.1", 200, 10676

The timestamp + timezone gets mangled a bit and you'd need to initially import some of the fields as strings, but you could post-process with a query (as in Mikhail's answer) to fix it up and avoid having yet-another-tool to configure and manage.

Adam Lydick
  • 1,092
  • 7
  • 15