0

How can I read a file that consists of a line (string of 10 csv) of numbers and text and then after this line, there are 4096 bytes?

Something like this:

  117,47966,55,115,223,224,94,0,28,OK: 
  \00\00\00\F6\FF\EF\FFF\00\FA\FF\00\CA\FF\009\00Z\00\D9\FFF\00\E3\FF?\00\F0\FF\00\B1\FF\9D\FF\00:\00b\00\E9\FF*\00:\00\00)\00\D3\FF,\00\C6\FF\D6\FF2\00\00!\00\00\00\FE\FF\BA\FF[\00\E8\FF.\00\F7\FF\F9\FF\E6\FF\00\D3\FF\F8\FF\00&\00\

In the past, I've been using ConstBitStream to read pure binary files. I was wondering how can I read line by line and every time I find 'OK:', use ConstBitStream to read the following 4096 bytes?

with open(filename, encoding="latin-1") as f:
        lines = f.readlines()
        for i in range(1,len(lines)):
            elements = lines[i].strip().split(',')
            if(len(elements)==10):
                readNext4096bytes()
rebrid
  • 430
  • 8
  • 27

2 Answers2

0

Let me know if this works:

import pandas as pd
from bitstring import ConstBitStream

# Read csv using pandas
df = pd.read_csv(filename, error_bad_lines=False, encoding='latin1')

# Take the last column (10) and cast every value to ConstBitStream
df.iloc[:, 9].apply(ConstBitStream)
Laurens Koppenol
  • 2,946
  • 2
  • 20
  • 33
  • I get the following error: `ParserError: Error tokenizing data. C error: Expected 1 fields in line 31, saw 2` I have to say that in the file there are many lines that are not correct. In my attempt indeed I was filtering out only lines with the exact number of fields and also checking that the last one was in fact 'OK:' – rebrid Aug 20 '19 at 09:00
  • It is skipping many lines now, which could be correct but then it stops with `ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.` – rebrid Aug 20 '19 at 09:33
  • @brid what happens if you add `nrows=10` as kwarg to `pd.read_csv()` – Laurens Koppenol Aug 20 '19 at 11:28
0

Say your file is like this

    1,2,3,OK:
    4096 bytes
    5,6,7,OK:
    4096 bytes
    ...
  1. read the file in binary mode: file = open(file_name, 'rb').read()
  2. split: data = file.split(b',OK:\n')
    data is a list: [b'1,2,3', b'4096bytes\n4,5,6', b'4096bytes\n7,8,9', ..., b'4096bytes']
  3. for a typical element it is bitarray, record = element[:4096], element[4096+1:]
    you have of course to special-case the first and the last element...

PS if your file consists of ONE record and ONE bitarray, then data is simply
[b'1,2,3', b'4096bytes']

PPS if your binary string contains b',OK:\n' the method above fails but — The possible combinations of 5 bytes are 256**5, the number of 5 bytes sequences in 4096 bytes is 4096+1-5, hence the probability of this unfortunate possibility is 4092/256**5 → 3.7216523196548223e-09 * in a single binary record * — if you have a few record its probably OK, if you have a few millions records, well you need a lot of memory but the probability of an error is no more negligible.

gboffi
  • 22,939
  • 8
  • 54
  • 85
  • It is indeed multiple records. I get this error though `ValueError: Must have exactly one of create/read/write/append mode and at most one plus` on the open function – rebrid Aug 20 '19 at 09:29
  • Oops, it's `open(..., 'rb')` . I have edited the answer. – gboffi Aug 20 '19 at 09:36
  • OK, now it reads it properly. What's the fastest way to store each record in a different array? Now they are in one list and it returns `IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable `--NotebookApp.iopub_data_rate_limit`. Current values: NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec) NotebookApp.rate_limit_window=3.0 (secs)` – rebrid Aug 20 '19 at 09:48
  • I have not your files nor your hardware+software combination so I think I cannot help you further. – gboffi Aug 20 '19 at 09:52