-3

I am looking for some type of regular expression (or similar) that I can use in UltraEdit text editor that will find text between a fixed number of commas (ignoring anything in quotes as shown below) and allow me to use do a Find and Replace.

In this example I want to find the text between commas 2 and 3 (for data in Column 3) and then be able to do a Find/Replace. So I am just looking for an expression I can plug in to the editor's Find and Replace function to find where this text is (for all rows) and replace is with what I have in the editor.

enter image description here

249716,"UNITED HEALTHCARE OF NEW YORK, INC",1007,ORGANIZATION,1007,US,,PO BOX 5290
249717,"ERIE FAMILY LIFE INSURANCE, COMPANY",INT018249699,ORGANIZATION,INT018249699

enter image description here

Nick
  • 268
  • 8
  • 33
  • There are plenty of regexp online sandboxes, where you can try things out. [Regex101](https://regex101.com/) is good one. – MyICQ Jul 20 '23 at 19:16
  • What you ask btw is hard for regexp, since you basically try to parse CSV, a fragile format. Imagine column 2 being `ACME CORP` vs `ACME CORP, INC` vs `ACME "SESAME" CORP` vs `ACME "SESAME" CORP, INC`. There's all sorts of things that will break the logic. Get a parser like Python, Perl or similar to extract the data. Much more robust. – MyICQ Jul 20 '23 at 19:25
  • I am not familiar enough with regular expressions, but was hoping there would be an expression that would accomplish this. – Nick Jul 20 '23 at 19:38
  • If your data is absolutely uniform, then yes. But if some lines contain different amounts of commas or quotes, it will break. You'll probably spend more time cleaning that up than you would parsing the CSV in the first place. – MyICQ Jul 20 '23 at 19:46
  • You may be better off opening the file in a spreadsheet program (which can easily import/parse CSV). Select the relevant column, search&replace, then export the result as a new CSV. – jhnc Jul 21 '23 at 04:18
  • @MyICQ The examples for data column 2 as posted by you are not all correct. Correct would be: `ACME CORP` or `"ACME CORP"`, `"ACME CORP, INC"`, `"ACME ""SESAME"" CORP"`, `ACME ""SESAME"" CORP, INC"` Fields with embedded commas or double-quote characters must be quoted. It looks like in this case the second data column contains company names which often contain one or more commas. There is quoted always the second field value for that reason which makes the search expression easier. Try the expression in my answer with `(?:"[^"]*")+` for matching the second field value on being always quoted. – Mofi Jul 22 '23 at 17:16
  • @Mofi that is correct, but as you know, CSV is a very fragile format. I work with clients every day and I have seen all kinds of inconsistent quoting and data exports, which will make regular expressions difficult. That's why I said to original poster that he should guarantee that data is consistent. And your answer probably assumes some consistency as well ? – MyICQ Jul 22 '23 at 18:25
  • 1
    @MyICQ I agree. The CSV file format is so simple, but so many programmers write code which write data into a CSV file using a not valid format. I cannot count anymore how often I needed to fix data in a CSV file in UltraEdit before it could be further processed in the application the data are for just because of the programmer of the data writing tool or user ignored the simple syntax rules. I wonder since 20 years why Microsoft Excel can save data in an Excel file with newline characters in a cell value fine into a CSV file, but fails to import the data correct from own generated CSV file. – Mofi Jul 23 '23 at 09:42

1 Answers1

1

A Perl regular expression for finding and selecting the field value in third data column in the CSV file is:

^[^\r\n,]*,(?:"[^"]*")+,\K[^\r\n,]*

This search expression works only under the following conditions:

  1. The first field value contains never a comma and is not enclosed in " for that reason too.
  2. The second field value must be enclosed in " and can be therefore a string containing a comma as well as newline characters and also " on being correct encoded as described by the Wikipedia article about comma-separated values format.
  3. The third field value must be without a comma and is not enclosed in " for that reason too.

Explanation of the regular expression in Perl syntax:

  1. ^ instructs the Perl regular expression engine to start each search at beginning of a line.

  2. [^\r\n,]* is a negative class definition to find any character NOT being a carriage return or a line-feed or a comma zero or more times. That expression matches the first comma separated value. There could be used also [0-9]+ for the given example.

  3. , must be found next (and not a newline character) being the first separator in the data row.

  4. (?:...) is a non-marking / non-capturing group to define an expression on which the multiplier + is applied for finding strings matching the expression inside the group one or more times.

  5. "[^"]*" is for finding strings beginning with " and ending with " and consisting of zero or more characters NOT being the character " which includes also newline characters like carriage return and line-feed. This expression applied one or more times matches the string value of second field value in each data row.

  6. , must be found next being the second separator in the data row.

  7. \K instructs the Perl regular expression engine to keep back the already matched string which results in getting this part of the found string not selected in UltraEdit and ignored for the replace operation. If the replace should modify the first or second field value in each data row in any way, remove \K from the search expression.

  8. [^\r\n,]* is again a negative class definition to find any character NOT being a carriage return or a line-feed or a comma zero or more times. That last part of the expression matches and selects the third field value in each data row on not being a string value which could contain a comma, carriage return, line-feed or double quote which would require enclosing the field value in " for valid CSV format.


It is also possible with a CSV file not containing newline characters inside a field value clicking in UltraEdit on third ribbon tab Edit in fourth group Column / block on command item CSV convert or in contemporary menu Edit in submenu CSV convert on first menu item Convert to fixed-width or in traditional menu Column on menu item Convert to fixed column for opening the dialog window Convert to Fixed Columns.

In this dialog window should be unchecked the first option Scan first line only (vs. complete file).

The second option Keep delimiter with fixed columns can be checked or unchecked as preferred by the user.

The third option Ignore separator in quotes (') should be unchecked in this case with CSV file using correct " to enclose a value with a comma, a double quote or a newline character in double quotes.

The fourth option Ignore separator in doubles quotes (") must be checked for this CSV file.

The fifth option Separator character (^t for tab): should be configured with , as separator.

A click on button Scan fills the last option Field widths (separated by comma ',') and a click on button Convert makes the conversion from comma-separated to fixed column width format.

Now replacements on field values can be done much easier by using either a Perl regular expression replace with ^.{x} to match x number of characters from the beginning of each line or there are used the advanced replace options In column x to y to do the replace only in the specified character column range on each line in the file. Note: The first character column has the column number 0 in the replace window. In the status bar is the column number shown with value 1 for the first character column.

The file can be converted back to CSV file with the command Convert fixed-width to CSV (ribbon mode and toolbar/menu mode with contemporary menus) respectively Convert to character delimited (toolbar/menu mode with traditional menus).


There can be used also an UltraEdit script for true CSV file parsing and do the modification with the script code.

Examples are:

Mofi
  • 46,139
  • 17
  • 80
  • 143