0

So I hope this question already hasn't been answered, but I can't seem to figure out the right search term.

First some background: I have text data files that are tabular and can easily climb into the 10s of GBs. The computer processing them is already heavily loaded from the hours long data collection(at up to 30-50MB/s) as it is doing device processing and control.Therefore, disk space and access are at a premium. We haven't moved from spinning disks to SSDs due to space constraints.

However, we are looking to do something with the just collected data that doesn't need every data point. We were hoping to decimate the data and collect every 1000th point. However, loading these files (Gigabytes each) puts a huge load on the disk which is unacceptable as it could interrupt the live collection system.

I was wondering if it was possible to use a low level method to access every nth byte (or some other method) in the file (like a database does) because the file is very well defined (Two 64 bit doubles in each row). I understand too low level access might not work because the hard drive might be fragmented, but what would the best approach/method be? I'd prefer a solution in python or ruby because that's what the processing will be done in, but in theory R, C, or Fortran could also work.

Finally, upgrading the computer or hardware isn't an option, setting up the system took hundreds of man-hours so only software changes can be performed. However, it would be a longer term project but if a text file isn't the best way to handle these files, I'm open to other solutions too.

EDIT: We generate (depending on usage) anywhere from 50000 lines(records)/sec to 5 million lines/sec databases aren't feasible at this rate regardless.

lswim
  • 3,064
  • 2
  • 15
  • 13
  • 1
    why not just collect your data directly into a database? – MattDMo Jun 18 '14 at 20:24
  • Unfortunately, the instrument control and collection software has a horrible database interface that is incredibly slow, trying to create a record takes 50 ms and we are generating 5 million records a second. We could put the big data files in a blob after they are done, but that doesn't resolve our issues as the data is being collected live for many hours. – lswim Jun 18 '14 at 20:28
  • 1
    What OS are you on? This is a very interesting question. – Patrick Collins Jun 18 '14 at 20:30
  • Once again, another unfortunate issue with the collection software. It is only written for Windows. – lswim Jun 18 '14 at 20:31
  • I also forgot to mention. The data rate does very more than I initially specified, sometimes the transfer rates are as low as 800 KB/s at about 50000 lines/sec (50 KHz sampling). – lswim Jun 18 '14 at 20:42
  • ``numpy.memmap`` might be interesting for you. It is used for accessing small segments of large files on disk, without reading the entire file into memory (http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html). – Dietrich Jun 18 '14 at 20:56

3 Answers3

1

This should be doable using seek and read methods on a file object. Doing this will prevent the entire file from being loaded into memory, as you would only be working with file streams.

Also, since the files are well defined and predictable, you won't have any trouble seeking ahead N bytes to the next record in the file.

Below is an example. Demo the code below at http://dbgr.cc/o

with open("pretend_im_large.bin", "rb") as f:
    start_pos = 0
    read_bytes = []

    # seek to the end of the file
    f.seek(0,2)
    file_size = f.tell()

    # seek back to the beginning of the stream
    f.seek(0,0)

    while f.tell() < file_size:
        read_bytes.append(f.read(1))
        f.seek(9,1)


print read_bytes

The code above assumes pretend_im_large.bin is a file with the contents:

A00000000
B00000000
C00000000
D00000000
E00000000
F00000000

The output of the code above is:

['A', 'B', 'C', 'D', 'E', 'F']
nOOb cODEr
  • 236
  • 1
  • 4
  • 1
    Is `f.seek()` guaranteed to not read any of the intervening bytes? I imagine that's a very platform-dependent thing. I would be somewhat surprised if `f.seek()` knew how to advance by 9 bytes without actually reading 9 bytes. – Patrick Collins Jun 18 '14 at 20:52
  • I can't think of a platform that does not support file seeking as used here. If you find one, let me know? Regardless of how it is implemented behind the scenes, the point is that the entire file is not read into memory at the same time (nor are entire lines). For reference: the `lseek` method in linux (http://linux.die.net/man/2/lseek), and the `SetFilePointer` method in window (http://msdn.microsoft.com/en-us/library/windows/desktop/aa365541(v=vs.85).aspx), and the `lseek` method on mac (https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/lseek.2.html) – nOOb cODEr Jun 18 '14 at 20:59
  • Just because the underlying OS supports it doesn't mean that `f.seek` has been implemented in such a way that takes advantage of it. I don't think Python provides any guarantee that it's going to make use of those OS features, so I would be hesitant to rely on it (without either examining the underlying C implementation or doing extensive testing) – Patrick Collins Jun 18 '14 at 21:01
0

I don't think that Python is going to give you a strong guarantee that it won't actually read the entire file when you use f.seek. I think that this is too platform- and implementation-specific to rely on Python. You should use Windows-specific tools that give you a guarantee of random acess rather than sequential.

Here's a snippet of Visual Basic that you can modify to suit your needs. You can define your own record type that's two 64-bit integers long. Alternatively, you can use a C# FileStream object and use its seek method to get what you want.

If this is performance-critical software, I think you need to make sure you're getting access to the OS primitives that do what you want. I can't find any references that indicate that Python's seek is going to do what you want. If you go that route, you need to test it to make sure it does what it seems like it should.

Patrick Collins
  • 10,306
  • 5
  • 30
  • 69
0

Is the file human-readable text or in the native format of the computer (sometimes called binary)? If the files are text, you could reduce the processing load and file size by switching to native format. Converting from the internal representation of floating point numbers to human-reading numbers is CPU intensive.

If the files are in native format then it should be easy to skip in the file since each record will be 16 bytes. In Fortran, open the file with an open statement that includes form="unformated", access="direct", recl=16. Then you can read an arbitrary record X without reading intervening records via rec=X in the read statement. If the file is text, you can also read it with direct IO, but it might not be that each two numbers always uses the same number of characters (bytes). You can examine your files and answer that question. If the records are always the same length, then you can use the same technique, just with form="formatted". If the records vary in length, then you could read a large chunk and locate your numbers within the chunk.

M. S. B.
  • 28,968
  • 2
  • 46
  • 73