Skip Line By Prefix

Question

I've been trying to use Azure Data Lake Analytics to do some analysis over a large group of IIS log files. So far I can get this to work for a single, best-case file using something like this:

@results = 
     EXTRACT
     s_date DateTime,
     s_time string,
     s_ip string,
     cs_method string,
     cs_uristem string,
     cs_uriquery string,
     s_port int,
     cs_username string,
     c_ip string,
     cs_useragent string,
     sc_status int,
     sc_substatus int,
     sc_win32status int,
     s_timetaken int
FROM @"/input/u_ex151115.log"
USING Extractors.Text(delimiter:' ', skipFirstNRows: 4);

@statuscount = SELECT COUNT(*) AS TheCount,
           sc_status
           FROM @results
           GROUP BY sc_status;

OUTPUT @statuscount
  TO  @"/output/statuscount_results.tsv"
USING Outputters.Tsv();

As you can see, in the EXTRACT statement, I'm skipping over the IIS log file header using the skipFirstNRows attribute. The problem I'm running into is that many of the log files I have as input contain headers in the middle of the file, presumably because the IIS app pool restarted at some point during the day. When I try to include these files in my query, I get the following error:

Unexpected number of columns in input record at line 14. Expected 14 columns, processed 6 columns out of 6.

The error references a location somewhere in the file where it's encountered the header text.

My question is, using the Text extractor, is there a way to direct it to skip processing a line based on the starting character of the line or something similar? Or, will I need to write a custom extractor to accomplish this?

You don't necessarily need a custom extractor. You could import the data as one column, then shred it after. There have been a couple of recent questions on similar topics, eg [here](http://stackoverflow.com/questions/40389614/writing-custom-extractor-in-usql-to-skip-rows-with-encoding-problems?rq=1) and [here](http://stackoverflow.com/questions/41117878/usql-how-to-select-all-rows-between-two-string-rows-in-usql). — wBob, Dec 14 '16 at 15:07
Nothing wrong with Log Parser but this is really an exercise in learning for me, so using a tool I'm familiar with doesn't really help with that :) Anyway the two items you linked to look promising will look into those options. Thanks! — chris.house.00, Dec 14 '16 at 15:36
@wBob your first link pointed me in the right direction, see my answer below. Thanks for your help! — chris.house.00, Dec 14 '16 at 19:07

score 1 · Accepted Answer · answered Dec 14 '16 at 19:06

Based on the documentation for the Text extractor, using the slient parameter will cause any lines that do not have the correct number of columns to silently fail, allowing processing to continue on to the next line. Since the IIS log header doesn't have the same number of columns as the log data, setting this attribute to true solved my problem.

So, my revised code looks like:

@results = 
     EXTRACT
     s_date DateTime,
     s_time string,
     s_ip string,
     cs_method string,
     cs_uristem string,
     cs_uriquery string,
     s_port int,
     cs_username string,
     c_ip string,
     cs_useragent string,
     sc_status int,
     sc_substatus int,
     sc_win32status int,
     s_timetaken int
FROM @"/input/u_ex140521.log"
USING Extractors.Text(delimiter:' ', silent: true);
@statuscount = SELECT COUNT(*) AS TheCount,
           sc_status
           FROM @results
           GROUP BY sc_status;

OUTPUT @statuscount
  TO  @"/output/statuscount_results.tsv"
  USING Outputters.Tsv();

Skip Line By Prefix

1 Answers1