2

I want to parse words from a text file. Apostrophes should be preserved, but single quotes should be removed. Here is some test data:

john's apostrophe is a 'challenge'

I am experimenting with grep as follows:

grep -o "[a-z'A-Z]*" file.txt

and it produces:

john's
apostrophe
is
a
'challenge'

Need to get rid of those quotes around the word challenge.

The correct/desired output should be:

john's
apostrophe
is
a
challenge

EDIT: As the consensus seems to be that apostrophes are problematic to recognize, I am now seeking a way to strip any kind of apostrophe (leading, trailing, embedded) out of all words. The words are to be added to a vocabulary index. The phrase searching should also strip out apostrophes. This may need another question.

shanethehat
  • 15,460
  • 11
  • 57
  • 87
ScrollerBlaster
  • 1,578
  • 2
  • 17
  • 21

2 Answers2

4

Do you need to use grep? Here's a sed example just in case:

$ echo "john's apostrophe is a 'challenge'" | sed -re "s/'(\S*)'/\1/g"
john's apostrophe is a challenge

sed is a stream editor, I used it to perform a substitution (the format is s/pattern/subst/, g stands for global. I'm matching an arbitrary number (*) of non-whitespace characters (\S) and substitute it by the same group of characters, referring to it as \1 (I captured it with round brackets (...).

Edit: All right, here's an ugly Perl-like grep example:

$ echo "john's apostrophe is a 'challenge'" | grep -oP "(?<=')\S*(?=')|\w+'?\w*"
john's
apostrophe
is
a
challenge

I have no idea what I've done, so unexpected behavior is likely :)

With grep I used positive lookaround assertions to match either a word in single quotes (the assertions are used for the quotes not to be a part of the match) or (|) a word with an optional apostrophe, which is represented with "one or more word characters" (\w+) followed by ' (or not) and then optionally some word characters again.

More edit: here's a sed command that seems to do the job and copes with @tchrist's example:

$ echo "john's apostrophe is a 'challenge'" | sed -re "s/(\W|^) '(\w*)'(\W|$)/\1\2\3/g"
john's apostrophe is a challenge
$ echo "’Tis especially hard, ’tisn’t it now, to leave it for the dogs’ breakfast, let a lone for the cats'" | sed -re "s/(\W|^)'(\w*)'(\W|$)/\1\2\3/g"
’Tis especially hard, ’tisn’t it now, to leave it for the dogs’ breakfast, let a lone for the cats'
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
  • Wow. Two examples that work. Now if only somebody could explain it. I don't have to use grep at all. Problem originally arose from the desire to parse words from a text file using Java. Will either of these work in Java? – ScrollerBlaster Mar 21 '12 at 20:51
  • 2
    Parse me: `’Tis especially hard, ’tisn’t it now, to leave it for the dogs’ breakfast, let alone for the cats’.` – tchrist Mar 21 '12 at 21:04
  • I added some explanations in the answer, and unfortunately I don't know how it's done in Java. As @tchrist points out, the examples don't work well with apostrophes in the beginning of the words. – Lev Levitsky Mar 21 '12 at 21:09
  • You can use the same pattern with Java as you’ve used there. It supports all that stuff. – tchrist Mar 21 '12 at 21:17
  • I've updated the answer taking dogs' breakfast into account, in case you change youd mind :) – Lev Levitsky Mar 21 '12 at 23:41
4

Here's a simpler grep-only approach:

grep -E -o "[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?" file.txt

which in Java is:

Pattern.compile("[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?")

(Both of those mean "an ASCII letter, optionally followed by a mixture of ASCII letters and/or apostrophes and an ASCII letter". The idea being that the matched substring has to start with a letter and end with a letter, but if it's more than two characters long, then it can contain apostrophes.)

To accept non-ASCII letters, the Java could be written as:

Pattern.compile("\\p{L}([\\p{L}']*\\p{L})?")

Edit for updated question (stripping out apostrophes): I don't think you can do that with just grep; but expanding our repertoire a bit, you can write:

tr -d "'" file.txt | grep -E -o "[a-zA-Z]+"

or in Java:

String apostrippedStr = str.replace("'", "");

Pattern.compile("[a-zA-Z]+") // or "\\p{L}+" for non-ASCII support
// ... apply pattern to apostrippedStr
ruakh
  • 175,680
  • 26
  • 273
  • 307
  • Er no. A letter would be `\pL`. – tchrist Mar 21 '12 at 21:01
  • @tchrist: Perl has spoiled you; EREs don't have `\p` (though Java does). But I take your point. The OP was using `A-Z` and `a-z`, so I'll edit my answer to specify "ASCII letter". – ruakh Mar 21 '12 at 21:04
  • Yeah, yeah. I never use the system `grep`; I have my own, you know. The Java(-7) subset of Perl’s regexes is the minimal tolerable regex system for modern text processing. At least it finally meets Level 1 compliance for tr18. – tchrist Mar 21 '12 at 21:06
  • Your approach may work for his small sample, but real English words can have apostrophes at their end (think possessive plurals like *these species’ names*) and even at their front (like *’tisn’t*). Your pattern disallows those falling aft or fore alike, and doesn’t allow hyphenated words, either. Well, if the original querent didn’t think of it, I suppose you needn’t be expected to accommodate such things, either. It just won’t work on real-world data, is all. —— BTW, like Perl, Java doesn’t require braces around one-letter properties, so `\pL` suffices… and Huffman triumphs. – tchrist Mar 21 '12 at 21:12
  • 1
    @tchrist: You're quite right, but in the general case it's impossible to distinguish single-quotes from apostrophes programmatically; to take an extreme case, is `'n'` the letter *n* in single-quotes, or is it a contraction for *and*? (Regarding `\pL` vs. `\p{L}` -- the documentation uses the latter, so I take it to be preferable. Java has a firm policy of being as verbose as possible. I don't know why it supports regexes at all, but even there it's managed to make them longer and more unwieldy.) – ruakh Mar 21 '12 at 21:17
  • @tchrist querent? I resemble that remark. I have since thought of the trailing and leading apostrophes. It would be nice to accommodate them. Perhaps my question should be edited. I am giving this a +1 at least, because it answered my original limited question perfectly. – ScrollerBlaster Mar 21 '12 at 21:26
  • Closing this off. If you could look at my edit above. As a quick and dirty alternative, is it instead feasible to normalize the words (take out all apostrophes) before adding them to a vocabulary index, for searching of text documents? – ScrollerBlaster Mar 21 '12 at 22:36
  • Much obliged @ruakh.BTW it should read: `tr -d "'" – ScrollerBlaster Mar 21 '12 at 22:56