1
File 1: 1356775 lines
File 2: 9516 lines

File 2 contains lines of numbers which when matched in File 1 should be deleted from that file. Example:

File 1

34234323432 some useless stuff
23423432342 more useless stuff
98989898329 foo bar blah
65367389473 one two three

File 2

234234323
653673894

New File

34234323432 some useless stuff
98989898329 foo bar blah

My approach right now is to

  1. Read entire file2 content into an array
  2. Get first line of File1 and extract first 8 numbers
  3. Loop through entire array from step 1 to see if 8 numbers from step1 match
  4. If numbers don't match then write line from step1 into a new file
  5. If they match then break out of the loop and don't write the line to new file
  6. continue until there are no more lines to read from step2

However, since the file is so big, it take an enormous amount of time to do this since for each line in file1 we are looping through entire array(9516 elements). Is there a simpler way to do this type of file manipulation without putting records from file in a DB table.

Omnipresent
  • 29,434
  • 47
  • 142
  • 186

2 Answers2

1

Read file2 in a Hash with the number as key and 'true' as value. Hashes are designed to be fast at lookups - much faster then arrays.

Community
  • 1
  • 1
steenslag
  • 79,051
  • 16
  • 138
  • 171
  • 1
    Searching an Array for each line results in `O(N*M)` performance for N lines and M triggers, whereas with a Hash it's pretty much `O(N)` time. – tadman Feb 09 '12 at 15:48
  • As Hashes are implemented as search trees, it's `O(M*log(N))` for them. Still much faster for big N. – jupp0r Feb 09 '12 at 16:21
  • awesome. Great information guys. just made the changes, lets see when it finishes. I'll update the post with rsults. – Omnipresent Feb 09 '12 at 16:21
0

You could read chunks of File1 into memory, avoiding a lot of blocking IO.

jupp0r
  • 4,502
  • 1
  • 27
  • 34