0

Situation: I have a CVD (ClamAV Virus Database) file loaded into RAM using mmap. The format of every line in the CVD file is same as the one of CSV files (':' delimited). Below is a snippet of the code:

def mapping():
    with open("main.cvd", 'rt') as f:
        global mapper
        mapper = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
        csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)

def compare(hashed):
    for row in csv.reader(mapper, dialect='delimit'):
        if row[1] == hashed:
            print('Found!')

Problem: When run, it returns the error _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

Question: How do I read CSV files as text that have been loaded to memory?

Additional information 1: I have tried using StringIO, it throws the error TypeError: initial_value must be str or None, not mmap.mmap

Additional information 2: I need the file to be in the RAM for faster access to the file and I cannot sacrifice time reading it line by line using functions such as readline()

martineau
  • 119,623
  • 25
  • 170
  • 301
Timothy Wong
  • 689
  • 3
  • 9
  • 28
  • have you tried (( with open("main.cvd", 'b') as f: )) instead? – Amrit Jun 08 '17 at 04:39
  • Doesn't work either. – Timothy Wong Jun 08 '17 at 04:42
  • What are you trying to do, exactly? A `csv.reader` *returns an iterator that will function much like reading it line by line using `readline`* – juanpa.arrivillaga Jun 08 '17 at 04:54
  • @juanpa.arrivillaga Well I am comparing the elements in csv.reader. Is readline() able to read from mmap? – Timothy Wong Jun 08 '17 at 04:59
  • 1
    But *why mmap*? How big is your file? AFAIK, mmap let's you treat data *on disk* as **if** it all were in main memory. But you might as well iterate over the file line-by-line using a file-object, if you are going to use `csv.reader`. That's *exactly what csv.reader does*. Am I missing something? – juanpa.arrivillaga Jun 08 '17 at 05:08
  • @juanpa.arrivillaga nope not really. I just thought mmap loads it directly to memory and it would be faster to access the file (about 200 comparisons that need to be done as fast as possible) – Timothy Wong Jun 08 '17 at 05:29
  • No, typically the use case of mmap is a file that *doesn't fit into memory*. How much memory are you working with? How big is your file? – juanpa.arrivillaga Jun 08 '17 at 05:31
  • @juanpa.arrivillaga I have 2GB memory and working with a 260MB file – Timothy Wong Jun 08 '17 at 05:33
  • @juanpa.arrivillaga but does it make a significant enough difference in speed and performance to matter if I load it to memory and reading it from file? – Timothy Wong Jun 08 '17 at 05:34
  • 1
    Yes, but that isn't what your doing with `mmap` necessarily. You *could* presumably just read the file into memory. Anyway, I know `mmap` shouldn't really give you a performance benefit for sequential access, which is what it seems like you are trying to do.This entire thing reeks of the [XY-Probem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Your file is some sort of delimited file format, presumably, but what exactly are the contents? You could get really fast if it were representable as an array. – juanpa.arrivillaga Jun 08 '17 at 05:49
  • 1
    Honestly, just reading the entire thing into memory as a `bytes` object and using `.find` should be about as fast as you could go for what it looks like you are doing. You should provide as much detail as possible though, or wait until someone with CVD (ClamAV Virus Database) expertise happens by. – juanpa.arrivillaga Jun 08 '17 at 05:49
  • 2
    It is probably *slower* to read the file *then* process with CSV vs just directly reading the file **with** CSV. You are still subject to disk speed (with the mmap read) and you are still subject to the same processing speed of CSV. You are adding an additional constraint of `mmap`. Don't overcomplicate... *Premature optimization is the root of all evil* -- DonaldKnuth – dawg Jun 08 '17 at 12:55
  • 1
    Using a different delimiter is usually done by simply specifying it when creating the `csv.reader` object using its `delimiter=` keyword rather than using `csv.register_dialect()` as you're doing. That said, you probably don't need to use the `csv` module which ought ti avoid the `TypeError` issue. Furthermore, using `mmap` is not the same as reading the entire file into RAM and it's very likely to not be any faster. The contents of a memory-mapped file still needs to be read from the file, it just doesn't happen until the memory is actually accessed for the first time. – martineau Jun 08 '17 at 14:28

1 Answers1

1

The csvfile argument to the csv.reader constructor "can be any object which supports the iterator protocol and returns a string each time its next() method is called".

This means the "object" can be a generator function or a generator expression. In the code below I've implement a generator function called mmap_file_reader() which will convert the bytes in the memory map into character strings and yield each line of output it detects.

I made the mmap.mmap constructor call conditional so it would work on Windows, too. This shouldn't be necessary if you used the access= keyword instead of prot= keyword—but I couldn't test that and so did it as shown.

import csv
import mmap
import sys

def mapping():
    with open("main.cvd", 'rt') as f:
        global mapper
        if sys.platform.startswith('win32'):
            mmf = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)  # windows
        else:
            mmf = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)  # unix
        mapper = mmap_file_reader(mmf)
        csv.register_dialect('delimit', delimiter=':', quoting=csv.QUOTE_NONE)

def mmap_file_reader(mmf):
    '''Yield successive lines of the given memory-mapped file as strings.

    Generator function which reads and converts the bytes of the given mmapped file
    to strings and yields them one line at a time.
    '''
    while True:
        line = mmf.readline()
        if not line:  # EOF?
            return
        yield str(line, encoding='utf-8')  # convert bytes of lineread into a string

def compare(hashed):
    for row in csv.reader(mapper, dialect='delimit'):
        if row[1] == hashed:
            print('Found!')
martineau
  • 119,623
  • 25
  • 170
  • 301
  • 1
    But I must ask, is this *really* going to be faster than simply reading the file into memory and using `splitlines`? – juanpa.arrivillaga Jun 08 '17 at 07:20
  • 1
    @juanpa.arrivillaga: It would probably not be faster than that; depends on the size of the file, pattern of access, speed of file I/O, and the OS. I usually use memory-mapped for files when they are huge and I don't want to read the whole thing into memory all at once and/or want fast random access later—which would preclude doing what you suggest. For sequentially reading a file, it's probably not a good choice. The OP seems to think that a memory-mapped file is the same thing as a RAM-based file, which, of course, is not the case. – martineau Jun 08 '17 at 13:23
  • @martineau So from what I understood so far, I have misunderstood the use of `mmap` as loading a file into memory. But I have ran some tests and found that the usual `open` and then `for ... in ...` without the use of mmap is faster - so I will stick to that. But your method works and answers my question so I'll mark it as an answer. Thank you so much – Timothy Wong Jun 09 '17 at 01:02
  • 1
    @Timothy: Thanks, if nothing else it can be useful trick to know and is useful for other things. I once used to to handle csv files with multi-character delimiters that the `csv` module normally couldn't handle. For what you're doing, it might be worthwhile to build an index that mapped hashed values to offsets in the .cvd file where the associated row of information started. That way you could look up the offset, then `seek()` to it in the `mmap`ped file and read in just the row of data needed. The index itself would be much smaller and. if saved in a file, could be read in very quickly. – martineau Jun 09 '17 at 01:35
  • @martineau Apologies. I ran the test 20 times on 4000+ files to find average performance and found out that `mmap`-ing the files is consistently faster than `open` (`open` code slows down with every time the code is run -- from 15s in first round to 30s in last round; while `mmap`-ing has it consistently running at about 15s) – Timothy Wong Jun 09 '17 at 01:53
  • Timothy: Memory-mapped files use the OS's demand paging mechanism, so it's doing I/O at a different layer of the system. which might explain the more consistent behavior. Regardless, the slowing down of the `open` method doesn't make sense. If anything, execution should get faster. I don't know how much or often you call the `compare()` function, but it could be made many times faster if it's used more than once. – martineau Jun 09 '17 at 12:31