0

I have 2 big files: the first one (10GB) contains text with occurrencies of keys in a specific format {keyX} and the second one (3GB) contains the mapping between keys and their values (45 milion entries).

file1:

Lorem ipsum {key1} sit amet, consectetur {key41736928} elit, ...

file2:

{key1} dolor
...
{key41736928} adipiscing
...

Considering the dimension of the second file I can't load all the key-value pairs in memory but I cannot search in the entire second file for every key's occurrence.

How can I substitute all the keys in the first file with the relative values in the second file in a decent amount of time?

Community
  • 1
  • 1
Andrea Bergonzo
  • 3,983
  • 4
  • 19
  • 31

2 Answers2

0

You could split the second file into multiple dictionaries and process the first file against each of these dictionaries. But how many dictionaries? I would say, conduct an experiment in which you process (say) 1Mb of data from the first file against varying amounts from the second (say) 10Mb, 100Mb, 200Mb, 500Mb to determine (a) whether there is a level at which your available resources are unable to cope, and (b) how time varies with dictionary size for this pair of files. Then make a judgement concerning when this is a viable approach and, if so, what size allocations to use.

Bill Bell
  • 21,021
  • 5
  • 43
  • 58
0

Use a binary search in the second file. It is ordered by key so the best you can do is a log(n) search.

def get_row_by_id(searched_row_id):
    step = os.path.getsize(mid_name_file) / 2.
    step_dimension = step
    last_row_id = ""

    with open(mid_name_file, 'r') as f:
        while True:
            f.seek(int(step), 0)  # absolute position
            seek_to(f, '\n')
            row = parse_row(f.readline())
            row_id = row[0]

            if row_id == last_row_id:
                raise ValueError(searched_row_id)
            else:
                last_row_id = row_id

            if row_id == searched_row_id:
                return row[1]
            elif searched_row_id < row_id:
                step_dimension /= 2.
                step = step - step_dimension
            else:
                step_dimension /= 2.
                step = step + step_dimension


def seek_to(f, c):
    while f.read(1) != c:
        f.seek(-2, 1)


def parse_row(row):
    return row.split('\t')[0], row
Andrea Bergonzo
  • 3,983
  • 4
  • 19
  • 31