0

I have a .txt file which looks like :

2017-06-22 23:19:05,758 use database stocks
2017-06-22 23:21:27,056 CREATE TABLE stocksdata ( stock_exchange string,

stock_symbol string, date TIMESTAMP,

The regex I wrote is ^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}:\\d{2}:\\d{2}),(\\d{3})\\s((?i)(create|select|use).*)$.

But my output is

2017-06-22 23:19:05,758 use database stocks
2017-06-22 23:21:27,056 CREATE TABLE stocksdata ( stock_exchange string,

It is not taking lines in next line of input viz stock_symbol string, date TIMESTAMP,. I need to capture this line as well.

halfer
  • 19,824
  • 17
  • 99
  • 186
  • Do you mean you have a two line string? Try adding `(?si)` at the start of the pattern (then, you may remove the `(?i)` from your pattern). – Wiktor Stribiżew Jun 27 '17 at 06:42
  • Yes Wiktor I have 2 line string and this is what I did "^?si(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2}),(\d{3})\s((create|select|use).*)$" which didn't work out.I get nothing in the output when I run pig script. Thanks – Ashwini Kumar Jun 27 '17 at 08:32
  • No, I wrote "*add `(?si)`*" - you can try `(?si)^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}:\\d{2}:\\d{2}),(\\d{3})\\s((create|select|use).*)$` – Wiktor Stribiżew Jun 27 '17 at 08:33
  • Wiktor - I tried it and I get same output as I was getting earlier. Are you trying with Pig ?? – Ashwini Kumar Jun 27 '17 at 08:38
  • No, I just know it uses Java regex flavor. See [here](https://regex101.com/r/yjFxFR/1) what it matches. If it is not what you need, specify what you need to get in the end. – Wiktor Stribiżew Jun 27 '17 at 08:40
  • Can u post your pig script? – Taha Naqvi Jun 27 '17 at 09:27
  • ok so what you are suggesting is cool but try adding a random line in your input , your regex will match because of ?so I think. Please check this link http://regexr.com/3g879 . you will see that third line is not getting captured which is wha I wanna capture. – Ashwini Kumar Jun 27 '17 at 09:43
  • Hi TKHN, Right now I only want to parse through and save it as .csv so here is the script :: data = LOAD '/home/cloudera/dataset/test.log' USING org.apache.pig.piggybank.storage.MyRegExLoader('^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}:\\d{2}:\\d{2}),(\\d{3})\\s(.*)((?i)(create|select|use).*)$') AS (dt: chararray, time1: chararray, port: chararray, random1: chararray, query: chararray); STORE data INTO '/home/cloudera/ash1111' USING PigStorage(','); – Ashwini Kumar Jun 27 '17 at 09:45
  • Sorry for posting like this. I don't know best way to post it in a good format – Ashwini Kumar Jun 27 '17 at 09:46

2 Answers2

0

Try using the following pattern:

^(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2}),(\d{3})\s((?i)(create|select|use)[\s\S]*)$

I replaced the .* at the end with [\s\S]*, because the latter consumes new lines.

Uri Y
  • 840
  • 5
  • 13
0

Finally, this expression has worked out

(\d{4}-\d{2}-\d{2})\s+(\d{2}:\d{2}:\d{2}),(\d{3})\s(\w{4})\s(.)(()(create\s|select\s|use\s).(.\s\S?\D.\s\D)*)

Thank you for replies