1

Hi all my first post is for what I thought would be simple ...

I haven't been able to find an example of a similar problem/solution.

I have thousands of text files with thousands of lines of content in the form

<word><space><word><space><number>

Example:

    example for 1
    useful when 1
    for. However 1
    ,boy wonder 1
    ,hary-horse wondered 2

In the above example I want to exclude line 3 as it contains internal punctuation

I'm trying to use the GNU grep 2.25 however not having luck

my initial attempt was (however this does not allow the "-" internal to the pattern):

grep -v [:alnum:]*[:punct:]*[:alnum:]* filename 

so tried this however

grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[@]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename 

however I need to factor in spaces and - as these are acceptable internal to the string.

I had been trying with the :punct" set however now see it contains - so clearly that will not work

I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.

Has someone been able to achieve something similar?

B Ward
  • 37
  • 1
  • 7
  • When you use the named character classes, you have to write `[[:alnum:]]`, for example. You can also write `[[:alnum:][:punct:]]` if you happen to want an alphanumeric or punctuation character (but not a space or control character), which is why the doubled square brackets actually make sense. – Jonathan Leffler Aug 24 '16 at 05:28
  • The first line of data (`example 1`) doesn't match your 'word space word space number' schema — why should it be selected? What should happen to a line containing `O'Reilly Books 23`? Are the numbers single digits? Are single quotes allowed in words? What should happen with `Coelecanths, Dodos Etc 19`? It contains the 'word space word space number' pattern, but it also has the extra word with punctuation attached; should lines like that be selected? Working with regexes tends to generate a series of approximations to what you want — and the final result is usually subvertible by the perverse. – Jonathan Leffler Aug 24 '16 at 06:07
  • Thanks @JonathanLeffler you are right something went amiss during editing for format. Once I find an alternative grep utility I'll get this running. Appreciate your comment and I hope checking doesn't reveal many unexpected behaviors. Thanks – B Ward Aug 24 '16 at 18:29

2 Answers2

1

Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,

[!]*[?]*

will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).

You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.

grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?@\^_`{|}~]' filename

Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").

In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.

Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)

If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.

Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

Community
  • 1
  • 1
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    Using `[:alnum:]` is the same as using `[almnu:]`, is it not? You need `[[:alnum:]]`. – Jonathan Leffler Aug 24 '16 at 05:30
  • @JonathanLeffler Absolutely! Thanks for noticing. – tripleee Aug 24 '16 at 05:31
  • Thanks for the help just getting errors from a bug - GnuWin32\bin\grep: (standard input): Not enough space. Will reply once I sort this out - thanks – B Ward Aug 24 '16 at 05:38
  • That's certainly unrelated to your regex problem. http://savannah.gnu.org/bugs/?25414 – tripleee Aug 24 '16 at 05:57
  • yes it appears to be a known bug in the GNU grep version - looking for other options – B Ward Aug 24 '16 at 09:52
  • Not using Windows will not be a choice you regret. Barring that, Perl or Awk are nicely upwards-compatible with `grep`, though you'll need sore changes both in the regex dialect and in the overall approach. `grep '(a/b)' file` would be `awk '/\(a\/b\)/' file` or `perl -ne 'print if m%\(a/b\)%' file`. Of the two, Perl is more versatile, but harder to learn to write and understand. These days, maybe look at Python too if you are seriously considering Perl (though a Python one-liner for this is harder to come up with). – tripleee Aug 25 '16 at 04:00
  • Thanks for your help on this one. Saving huge amount of time but still suffering from some silly edge cases. Thank you – B Ward Aug 31 '16 at 02:38
1

On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.

In terms of grep -E (aka egrep):

grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'

That contains:

[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?

That detects a word with any punctuation surrounded by alphanumerics, and:

[[:space:]]+
[[:digit:]]+

which look for one or more spaces or digits.

Using a mildly extended data file, this produces:

$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$

It eliminates the for. However 1 line as required.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Thanks for your help on this one - it works as required. Saves a huge amount of processing time in SQL Server however still have a few nasty edge cases so might have to do a double-pronged approach. Appreciate the effort and completeness!! – B Ward Aug 31 '16 at 02:37