GREP - finding all occurrences of a string

Question

I am tasked with white labeling an application so that it contains no references to our company, website, etc. The problem I am running into is that I have many different patterns to look for and would like to guarantee that all patterns are removed. Since the application was not developed in-house (entirely) we cannot simply look for occurrences in messages.properties and be done. We must go through JSP's, Java code, and xml.

I am using grep to filter results like this:

grep SOME_PATTERN . -ir | grep -v import | grep -v // | grep -v /* ...

The patterns are escaped when I'm using them on the command line; however, I don't feel this pattern matching is very robust. There could possibly be occurrences that have import in them (unlikely) or even /* (the beginning of a javadoc comment).

All of the text output to the screen must come from a string declaration somewhere or a constants file. So, I can assume I will find something like:

public static final String SOME_CONSTANT = "SOME_PATTERN is currently unavailable";

I would like to find that occurrence as well as:

public static final String SOME_CONSTANT = "
SOME_PATTERN blah blah blah";

Alternatively, if we had an internal crawler / automated tests, I could simply pull back the xhtml from each page and check the source to ensure it was clean.

What specifically do you mean by "removed" when you talk about these patterns? What if the resulting file is syntactically incorrect as a result, or fails to run properly? Can you be confident that simple deletion of the entire sequence of characters will in each case not break the functionality of the program? (Since you mention the possibility of /* inside the patterns, I don't imagine that's the case. If it is, this is pretty simple. If it's not, I think you're effectively asking for a program that understands the source... effectively AI!) — Peter Hansen, Dec 04 '09 at 15:35

score 1 · Answer 1 · answered Nov 23 '09 at 21:07

1

To address your concern about missing some occurrences, why not filter progressively:

Create a text file with all possible matches as a starting point.
Use filter X (grep for '^import', for example) to dump probable false positives into a tmp file.
Use filter X again to remove those matches from your working file (a copy of [1]).
Do a quick visual pass of the tmp file and add any real matches back in.
Repeat [2]-[4] with other filters.

This might take some time, of course, but it doesn't sound like this is something you want to get wrong...

answered Nov 23 '09 at 21:07

grossvogel

6,694
1
25
36

sounds like a possible winner. I was hoping to find a regular expression that was the magic/easy button. – Nov 23 '09 at 21:20
I guess the question is what's more valuable to you: wasting an hour manually looking for possible false positives, or wasting an hour getting ripped a new one by your boss because your über-clever regexp missed some crazy convoluted corner case in the Java Language Specification. – Jörg W Mittag Nov 23 '09 at 23:41
I came from a mechanical engineering background, so I am aware that mistakes will occur ... I am trying to choose the path that will yield fewer mistakes and better results that are reproducible. A computer can do repetitive tasks without problem, humans on the other hand ... That is why computers exist. I can always tweak my regular expression, it only takes a minute to run; however, manually evaluating this can take days or weeks for the amount of content I'd have to go through and after a day or a few hours, I'm sure I might skip an occurrence or two here and there. – Dec 07 '09 at 14:01

psihodelia · Accepted Answer · 2009-11-23T20:55:10.620

0

I would use sed, not grep! Sed is used to perform basic text transformations on an input stream. Try s/regexp/replacement/ option with sed command.

You can also try awk command. It has an option -F for fields separation, you can use it with ; to separate lines of you files with ;.

The best solution will be however a simple script in Perl or in Python.

edited Nov 23 '09 at 20:55

answered Nov 23 '09 at 20:48

psihodelia

29,566
35
108
157

sed is what I ended up using. In fact it is very easy to use and once I figured out what regular expression I needed, everything fell into place. I simply daisy-chained my commands together sed -e s/regexp/replacement/ -e ... -e ... | grep SOME_PATTERN > occurrences – Dec 28 '09 at 13:38

GREP - finding all occurrences of a string

2 Answers2