I've been trying to use Azure Data Lake Analytics to do some analysis over a large group of IIS log files. So far I can get this to work for a single, best-case file using something like this:
@results =
EXTRACT
s_date DateTime,
s_time string,
s_ip string,
cs_method string,
cs_uristem string,
cs_uriquery string,
s_port int,
cs_username string,
c_ip string,
cs_useragent string,
sc_status int,
sc_substatus int,
sc_win32status int,
s_timetaken int
FROM @"/input/u_ex151115.log"
USING Extractors.Text(delimiter:' ', skipFirstNRows: 4);
@statuscount = SELECT COUNT(*) AS TheCount,
sc_status
FROM @results
GROUP BY sc_status;
OUTPUT @statuscount
TO @"/output/statuscount_results.tsv"
USING Outputters.Tsv();
As you can see, in the EXTRACT
statement, I'm skipping over the IIS log file header using the skipFirstNRows
attribute. The problem I'm running into is that many of the log files I have as input contain headers in the middle of the file, presumably because the IIS app pool restarted at some point during the day. When I try to include these files in my query, I get the following error:
Unexpected number of columns in input record at line 14. Expected 14 columns, processed 6 columns out of 6.
The error references a location somewhere in the file where it's encountered the header text.
My question is, using the Text extractor, is there a way to direct it to skip processing a line based on the starting character of the line or something similar? Or, will I need to write a custom extractor to accomplish this?