3

I am trying to search for lines in FileB (which is comma separated) that contain content from lines in FileA. I originally tried using grep but it does not seem to care for some of the characters in FileA. I do not assume that the CSV formatting would matter much, well at least to grep.

$ grep -f FileA FileB
grep: Unmatched [ or [^

I am open to using any generally available Linux command, Perl or Python. There is not a specific expression that can be matched which is the reason for using the content from FileA to match on. Below are some example lines that are in FileA that we want to match in FileB.

page=--&id='`([{^~
page=&rows_select=%' and '%'='
l=admin&x=&id=&pagex=http://.../search/cache?ei=utf-&p=change&fr=mailc&u=http://sub.domain.com/cache.aspx?q=change&d=&mkt=en-us&setlang=en-us&w=afe,dbfcd&icp=&.intl=us&sit=dbajdy.alt

The lines in fileB that contain the above strings will contain additional characters in the line, i.e. the strings the the two files will not be a one for one match:

fileA contains abc and fileB contains 012abc*(), 012abc*() would print

Astron
  • 1,211
  • 5
  • 20
  • 42
  • 1
    Can you show your files or an example with which we can "play" and find a proper answer? – fedorqui Jun 13 '13 at 13:37
  • I provide some of the items in FileA that we would like to match in FileB. As updated in the question, FileB is comma separated. – Astron Jun 13 '13 at 14:05

3 Answers3

2

A simple python solution would be:

with open('filea', 'r') as fa:
    with open('fileb', 'r') as fb:
        patterns = fa.readlines()
        for line in fb:
            if line in patterns:
                print line

which would store the whole pattern file in memory, and compare each line of the other file against the list.

but why wouldn't you just use diff? I'd have to look at the manpage, but I'm pretty sure there's a way to make it tell what are the similarities between two files. After googling:

they give that solution:

diff --unchanged-group-format='@@ %dn,%df 
%<' --old-group-format='' --new-group-format='' \
--changed-group-format='' a.txt b.txt
Community
  • 1
  • 1
zmo
  • 24,463
  • 4
  • 54
  • 90
  • This seems to provide exact matches. Is there a way for either method to match the contents of the line to the other, i.e. filea contains `abc` and fileb contains `012abc*()`, `012abc*()` would print? – Astron Jun 13 '13 at 21:08
  • well the diff solution only does exact matches, that's true. Otherwise it will do its job... of being a "Diff". – zmo Jun 14 '13 at 00:46
1

Use fgrep (or equivalently grep -F). That interprets the pattern (the contents of FileA) as a literal string to search for instead of a regular expression.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • I tried that and while no errors are printed, `fgrep` nor `grep -F` are unable to provide any matches. I was able to validate that the query `grep 'date=-- and ='` which resides in fileA returns returns the line from fileB. Any idea why the `-F` toggle is not working as advertised? – Astron Jun 13 '13 at 14:49
  • @Astron: I've no idea. You can try it with a simpler pattern, then step-by-step approximate the actual `FileB` to find out where `fgrep` goes wrong. – Fred Foo Jun 13 '13 at 15:13
  • Thanks for the feedback, I am going to see if I can figure out another method that may better handle the patterns. – Astron Jun 13 '13 at 15:33
1

Untested Solution:

Logic:

  • Store line from FileB in lines array
  • For each line in lines array;
  • Check if line in array appears as a part of your line in FileB
  • If index(..) returns > 0 then;
  • Print that line from FileB

awk 'NR==FNR{lines[$0]++;next}{for (line in lines) {if (index($0,line)>0) {print $0}}}' FILEA FILEB`
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
  • Logic sounds good but running it returns `^ unexpected newline or end of string`. Is there an extra bracket? – Astron Jun 13 '13 at 15:58
  • @Astron Sorry my bad. I missed one brace. Have updated the solution. – jaypal singh Jun 13 '13 at 16:18
  • The output from this was substantially more than what it should of been. FileA and lines from fileB are not a one for one match, does your answer expect it to be? – Astron Jun 13 '13 at 21:56
  • If you post some sample data from both files it might help debug the answer. – jaypal singh Jun 13 '13 at 21:59
  • Actually, I think this might work though should "Check if line in array appears as a part of your line in FileB" say "Check if line in array appears as a part of your line in FileA"? – Astron Jun 13 '13 at 22:11
  • 1
    @Astron Ahh crap! I was comparing line with value of an array instead of index. Try the updated solution please. I think it should fix the issue – jaypal singh Jun 13 '13 at 22:38