Matching the last K occurrences of a pattern in a line

Question

Is it possible using sed/awk to match the last k occurrences of a pattern in a line?

For simplicity's sake, say I just want to match the last 3 commas in each line, for example (note that the two lines have a different number of total commas):

10, 5, "Sally went to the store, and then , 299, ABD, F, 10
10, 6, If this is the case, and also this happened, then, 299, A, F, 9

I want to match only the commas starting from 299 until the end of the line in both bases.

Motivation: I'm trying to convert a CSV file with stray commas inside one of the fields to tab-delimited. Since the number of proper columns is fixed, my thinking was to replace the first couple commas with tabs up until the troublesome field (which is straightforward), and then go backwards from the end of the line to replace again. This should convert all proper delimiter commas to tabs, while leaving commas intact in the problematic field.

There's probably a smarter way to do this, but I figured this would be a good sed/awk teaching point anyways.

Instead of putting tabs (that risks to eventually transform the problem to an other), you should add a double quote at the right place when needed. — Casimir et Hippolyte, Oct 13 '16 at 15:41
Yes, this sounds like a classic XY problem. Fix the code that generates invalid CSV. — glenn jackman, Oct 13 '16 at 16:11
Definitely an XY problem. Show us what you are trying to do, not how you are trying to do it otherwise you're going to be shown how to implement a bad idea. — Ed Morton, Oct 13 '16 at 16:14
@glennjackman: I don't have control over how the CSV's are generated. The data sets are public and they just come with this issue. — gogurt, Oct 13 '16 at 16:30
@CasimiretHippolyte: some of the strings (in the problematic field) contain double quotes in addition to commas... wouldn't that leave me with the same issue again? — gogurt, Oct 13 '16 at 16:33
@gogurt: It depends if they are escaped or not (with an other double quote or a backslash). You should add the several problematic lines (real example) to your question, otherwise it's not possible to give you a useful answer. — Casimir et Hippolyte, Oct 13 '16 at 17:26

karakfa · Accepted Answer · 2016-10-13T16:21:20.777

2

another sed alternative. Replace last 3 commas with tabs

$ rev file | sed 's/,/\t/;s/,/\t/;s/,/\t/' | rev

10, 5, "Sally went to the store, and then , 299  ABD     F       10

with GNU sed, you can simply write

$ sed 's/,/\t/g5' file

10, 5, "Sally went to the store, and then , 299  ABD     F       10

replace all starting from 5th.

edited Oct 13 '16 at 16:21

answered Oct 13 '16 at 16:11

karakfa

66,216
7
41
56

score 1 · Answer 2 · edited May 23 '17 at 11:48

1

One regex that matches each of the three last commas separately would require a negative lookahead, which sed does not support. You can use the following sed-regex to match the last three fields and the commas directly before them all at once:

,[^,]*,[^,]*,[^,]*$

$ matches the end of the line.

[^,] matches anything but ,.

Groups allow you to re-use the field values in sed:

sed -r 's/,([^,]*),([^,]*),([^,]*)$/\t\1\t\2\t\3/'

For awk, have a look at How to print last two columns using awk.

There's probably a smarter way to do this

In case all your wanted commas are followed by a space and the unwanted commas are not, how about

sed 's/,[^ ]/./g'

This transforms a, b, 12,3, c into a, b, 12.3, c.

edited May 23 '17 at 11:48

Community

1
1

answered Oct 13 '16 at 15:52

Socowi

25,550
3
32
54

Brackets should be escaped if you don't use `-r` or `-E` flag in your first sed command and `([^,])` might be simplified to `(.*)` in that case. Moreover first of the three commas is removed, not replaced. I suggest you write `sed -r 's/,(.*),(.*),(.*),(.*)$/\t\1\t\2\t\3\t\4/ ' file` – SLePort Oct 13 '16 at 18:59
@Kenavoz You are right regarding the `()`. I fixed it. But `[^,]*` can *not* be replaced by `.*` in this case. The first comma of your regex will always match the first comma of the line. The first `.*` will match as much as possible, including every comma except the last three ones. Try it with `a,b,c,d,e,f,g,h,i`. – Socowi Oct 14 '16 at 07:04
1

You're right. `sed -r 's/(.*),(.*),(.*),(.*)$/\1\t\2\t\3\t\4/' file` should work better. Note that the first of three commas is still not replaced but removed with your sed. – SLePort Oct 14 '16 at 08:10
Thanks for the 2nd reminder. Fixed the problem. Your new version of the command looks right. – Socowi Oct 14 '16 at 08:20

choroba · Answer 3 · 2016-10-13T16:14:46.727

You can use Perl to add the missing double quote into each line:

perl -aF, -ne '$F[-5] .= q("); print join ",", @F' < input > output

or, to turn the commas into tabs:

 perl -aF'/,\s/' -ne 'splice @F, 2, -4, join ", ", @F[ 2 .. $#F - 4 ]; print join "\t", @F' < input > output

-n reads the input line by line.
-a splits the input into the @F array on the pattern specified by -F.
The first solution adds the missing quote to the fifth field from the right; the second one replaces the items from the third to the fifth from right with those elements joined by ", ", and separates the resulting array with tabs.

score 1 · Answer 4 · answered Oct 13 '16 at 16:56

To fix the CSV, I would do this:

echo '10, 5, "Sally went to the store, and then , 299, ABD, F, 10' |
  perl -lne '
    @F = split /, /;             # field separator is comma and space
    @start = splice @F, 0, 2;    # first 2 fields
    @end = splice @F, -4, 4;     # last 4 fields
    $string = join ", ", @F;     # the stuff in the middle
    $string =~ s/"/""/g;         # any double quotes get doubled
    print join(",", @start, "\"$string\"", @end);
  '

outputs

10,5,"""Sally went to the store, and then ",299,ABD,F,10

score 0 · Answer 5 · answered Oct 13 '16 at 15:53

0

Hi I guess this is doing the job

echo 'a,b,c,d,e,f' | awk -F',' '{i=3; for (--i;i>=0;i--) {printf "%s\t", $(NF-i) } print ""}'

Returns

d    e    f

But you need to ensure you have more than 3 arguments

answered Oct 13 '16 at 15:53

eric.v

605
5
9

Ed Morton · Answer 6 · 2016-10-13T22:51:41.450

This will do what you're asking for with GNU awk for the 3rd arg to match():

$ cat tst.awk
{
    gsub(/\t/," ")
    match($0,/^(([^,]+,){2})(.*)((,[^,]+){3})$/,a)
    gsub(/,/,"\t",a[1])
    gsub(/,/,"\t",a[4])
    print a[1] a[3] a[4]
}

$ awk -f tst.awk file
10       5       "Sally went to the store, and then , 299        ABD     F       10
10       6       If this is the case, and also this happened, then, 299  A       F       9

but I'm not convinced what you're asking for is a good approach so YMMV.

Anyway, note the first gsub() making sure you have no tabs on the input line - that is crucial if you want to convert some commas to tabs to use tabs as output field separators!

Matching the last K occurrences of a pattern in a line

6 Answers6