6

Is it possible to use numpy.memmap to map a large disk-based array of strings into memory?

I know it can be done for floats and suchlike, but this question is specifically about strings.

I am interested in solutions for both fixed-length and variable-length strings.

The solution is free to dictate any reasonable file format.

Community
  • 1
  • 1
NPE
  • 486,780
  • 108
  • 951
  • 1,012

2 Answers2

6

If all the strings have the same length, as suggested by the term "array", this is easily possible:

a = numpy.memmap("data", dtype="S10")

would be an example for strings of length 10.

Edit: Since apparently the strings don't have the same length, you need to index the file to allow for O(1) item access. This requires reading the whole file once and storing the start indices of all strings in memory. Unfortunately, I don't think there is a pure NumPy way of indexing without creating an array the same size as the file in memory first. This array can be dropped after extracting the indices, though.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • @aix: At least it's a completely different situation then. O(1) access to a single element is only possible after indexing the whole file. How long are the strings on average? Are they separated by new line characters or null characters or something completely different? – Sven Marnach May 05 '11 at 11:20
  • I control the process that produces the file. It can write the data in any format that would make life easy for the reading process. – NPE May 05 '11 at 11:32
  • @aix: What are the lengths of the strings? What's the distribution of lengths? Would it be prohibitvely wasteful to save them with a fixed width? – Sven Marnach May 05 '11 at 11:39
  • Fast-forward to 2018 -- is there any better way than fixing the string length? (For example can numpy interpret a start-index memmap as pointers into a flat array of bytes and treat it the same as an array of strings?) – user48956 Jun 14 '18 at 23:24
  • 1
    @user48956 In general, the answer is probably to use an actual database. – Sven Marnach Jul 03 '18 at 10:30
  • Thanks for that nugget of wisdom. – user48956 Jul 03 '18 at 14:47
2

The most flexible option would be to switch to a database or some other more complex on-disk file structure.

However, there's probably some good reason that you'd rather keep things as a plain text file...

Because you have control of how the files are created, one option is to simply write out a second file that only contains the starting positions (in bytes) of each string in the other file.

This would require a bit more work, but you could essentially do something like this:

class IndexedText(object):
    def __init__(self, filename, mode='r'):
        if mode not in ['r', 'w', 'a']:
            raise ValueError('Only read, write, and append is supported')
        self._mainfile = open(filename, mode)
        self._idxfile = open(filename+'idx', mode)

        if mode != 'w':
            self.indicies = [int(line.strip()) for line in self._idxfile]
        else:
            self.indicies = []

    def __enter__(self):
        return self

    def __exit__(self, type, value, traceback):
        self._mainfile.close()
        self._idxfile.close()

    def __getitem__(self, idx):
        position = self.indicies[idx]
        self._mainfile.seek(position)
        # You might want to remove the automatic stripping...
        return self._mainfile.readline().rstrip('\n')

    def write(self, line):
        if not line.endswith('\n'):
            line += '\n'
        position = self._mainfile.tell()
        self.indicies.append(position)
        self._idxfile.write(str(position)+'\n')
        self._mainfile.write(line)

    def writelines(self, lines):
        for line in lines:
            self.write(line)


def main():
    with IndexedText('test.txt', 'w') as outfile:
        outfile.write('Yep')
        outfile.write('This is a somewhat longer string!')
        outfile.write('But we should be able to index this file easily')
        outfile.write('Without needing to read the entire thing in first')

    with IndexedText('test.txt', 'r') as infile:
        print infile[2]
        print infile[0]
        print infile[3]

if __name__ == '__main__':
    main()
Joe Kington
  • 275,208
  • 71
  • 604
  • 463