in our company we pull in inventory files from third parties. These files are in a fixed format, containing the 13-digit EAN (think UPC code) as well as other data. I also have a master list of EANs in our database.
I would like to compare the master file with the new file and remove all lines from the new file, which contain an EAN, which is not in the master.
Example:
Master
1234567890123
4567890123456
New file
1234567890123
4567890123456
5678901234567 <- remove this one
The new file contains data other than the EAN. The EAN is in the first column. The data is tab-separated.
I am currently doing this in PHP. The problem is both files have about 4 mn. rows each and my script is consuming a ton of memory. I currently load the whole master list into RAM and do isset()s.
Are there any smart linux tricks/programs which could help me?