0

My input:

2,india,"i join today,and
please
guide me,thank you",+91547854221


My requirement is find all CR or LF ( simply enter ) between "....." in single shot.

Required Output :

2,india,"i join today,and please guide me,thank you",+91547854221


I have regex for this . but it will find only one CR or LF at a time, but i want to find all CR LF in single shot but not in multiple shot.

My regex:

(\")(?!\,)([^"]?)(\n|\r)([^"]?\") ---->($3 is CR or LF ,i replacing with space)

replace with : $1$2 $4

What iam getting:

2,india,"i join today,and please
guide me,thank you",+91547854221

kiran
  • 83
  • 1
  • 1
  • 9
  • You have to do this in in a couple of steps. The first step is to do a 1 time validation of an even number of double quotes before you start the replacement step, I.e. if it passes with `^[^"]*(?:"[^"]*"[^"]*)*$` then you have to find all quoted entries and blindly replace all CRLF's. You can use a callback or just use search and remake a new string. If a global replace with callback, just use `("[^"]*")` then in the callback, blindly replace `[\r\n]+` with nothing, then return the results. It's a little more involved if double quotes can be escaped inside double quotes. –  Aug 14 '16 at 22:32
  • its hard to understand for me, can you please explain it clearly with each and every step. – kiran Aug 15 '16 at 05:18
  • I actually did explain it clearly in my comment but I'll try again. **Step 1:** Validate an even number of quotes exist in the file. A simple, if search `^[^"]*(?:"[^"]*"[^"‌​]*)*$` then go to next step. **Step 2:** Use a _nested_ replace. The outer replace `("[^"]*")`, on each match, remove _all_ CR or LF's from $1 (inner replace), then return that string to the outer replace. **This can also be accomplished by rewriting the csv string from scratch:** In a loop, globally find `([^"]*)("[^"]*"|$)`. Append $1 to new string. Blindly replace all CR, LF's from $2, append that to the new string. –  Aug 15 '16 at 15:31
  • To do it this way, If you can do this in your archaic language, it will be almost as fast as assigning one string to another. So, a 2 MB string would take less than a second. –  Aug 15 '16 at 15:40

1 Answers1

0

You can try [\n\r](?=(?:(?:[^"]*"){2})*[^"]*"[^"]*$) (replace with nothing or space). This will match \n or \r only if followed by an uneven number of double quotes ".

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
  • what do mean by {2}, the number of \n or \r are not fixed, sometimes they may be 3 and another time they may be 10. so, i can't say only this number of \n or \r will come all the time. i want regex which will find any number of \r or \n. Thank you for your replay. – kiran Aug 14 '16 at 09:08
  • i tried your regex, it is working in small number of lines say 3 to 10, but if number of lines are increased then it unable to replace, can you simplify the regex which can be used on millions of records in a csv file. Thank you. – kiran Aug 15 '16 at 05:21
  • @kiran - Rawing's regex will find a line break, then check If there are an _uneven_ number of quotes after it. This means the line break is in between a set of quotes (on a properly balanced csv file). This will work for a small number of lines, but the performance hit is _exponential_. Anything above a few hundred lines and you wait a long time. –  Aug 15 '16 at 15:16
  • i tried to remove in notepad++ and apache pig, both are failed for a csv file which contains 263 lines.simply it is not working for more number of lines. can you please give me any alternative regex for it. It will be very helpful to me for my scenario. Thank you for your response. – kiran Aug 16 '16 at 04:51