0

in our company we pull in inventory files from third parties. These files are in a fixed format, containing the 13-digit EAN (think UPC code) as well as other data. I also have a master list of EANs in our database.

I would like to compare the master file with the new file and remove all lines from the new file, which contain an EAN, which is not in the master.

Example: Master
1234567890123
4567890123456

New file 1234567890123 4567890123456
5678901234567 <- remove this one

The new file contains data other than the EAN. The EAN is in the first column. The data is tab-separated.

I am currently doing this in PHP. The problem is both files have about 4 mn. rows each and my script is consuming a ton of memory. I currently load the whole master list into RAM and do isset()s.

Are there any smart linux tricks/programs which could help me?

jotango
  • 475
  • 1
  • 5
  • 10
  • What are the formats of these files -- Are they *just* lists of EANs, or is there other data (e.g. `EAN xxx yyy \n EAN zzz aaa`)? -- This afects the solution :) – voretaq7 Feb 08 '10 at 21:05
  • @voretaq7: "as well as other data" – Dennis Williamson Feb 08 '10 at 21:22
  • @Dennis Williamson: So... what's the format? :) (is the EAN the first column, the last, somewhere in the middle? Tab/Space/comma/pipe/whitespace/null delimited, etc? :) -- I'll take a stab at an answer but it will probably need modifying – voretaq7 Feb 08 '10 at 21:25

2 Answers2

1

Rephrasing the question in a more grep-friendly way, you want to print all lines which match an EAN from some master list of EANs.

Assuming that something resembling an EAN won't show up anywhere except in the EAN column, try:

  • Extract all the EANs from master
  • Squish that list of EANs into a regex
  • Feed the regex to egrep

Assuming the EAN is the first column of master (and that master contains other columns)

egrep "(`cat master | awk '{print $1}' | tr '\n' '\\|' | sed 's/|$//'`)" newfile

should come close (you can remove the awk if master is just an EAN list; nasty sed at the end to remove trailing | that results from the rest of the pipeline)

The above breaks down if EANs (or EAN-like 13-digit patterns) are present elsewhere in the data & would require a more complex regular expression to restrict the search to a specific column.

voretaq7
  • 79,879
  • 17
  • 130
  • 214
  • (correctness of the above bit of nasty verified in bourne shells - probably broken in C shells b/c of copious ugly :) – voretaq7 Feb 08 '10 at 21:47
  • Hi voretaq7, thanks for your answer! Without having tried, I would be scared to squish 4,5 mn. EANs into one regex... – jotango Feb 08 '10 at 22:11
  • 4.5M EANs is probably a bit much for *one* regex -- You can do head/tail trickery to split up master into chunks, but if you've got THAT many rows you might want to start looking at database-based solutions instead of scripts - an indexed `INNER JOIN` or `SELECT ... WHERE` would be much more space and time efficient :-) – voretaq7 Feb 08 '10 at 22:29
0

Try something like this:

# Put each code in one line, and sort them
sed -e 's/\ /\n/g' new | sort > neweans
sed -e '/s\ /\n/g' master | sort > mastereans

# Diff them by columns, and delete from the list
# the new's that are not in master. Then, print them

diff -y neweans mastereans | grep -v "<" | awk '{print $1}'
sntg
  • 1,450
  • 11
  • 15
  • hey sntg, thanks for your answer! How would I go about keeping the original data in the file? Your answer gives me the matching EANs, but I need the whole line... Unless I didn't understand your answer. – jotango Feb 08 '10 at 22:15
  • You got the EANS that are in neweans and in masterfile, files are untouched. Try to run the diff without the pipes and see the output. – sntg Feb 08 '10 at 22:44