0

I have TSV file with headers and some values. I want to read it and convert it into base 64 and later to NPZ format. I have a code as follows:

import base64
import numpy as np
import csv
import sys
import zlib
import time
import mmap

csv.field_size_limit(sys.maxsize)
   
FIELDNAMES = ['image_id', 'image_w','image_h','num_boxes', 'boxes', 'features']
infile = '/content/drive/MyDrive/test2015_resnet101_faster_rcnn_genome.tsv.0'


if __name__ == '__main__':

    in_data = {}
    with open(infile, "r+b") as tsv_in_file:
        reader = csv.DictReader(tsv_in_file, delimiter='\t', fieldnames = FIELDNAMES)
        for item in reader:
            for field in ['features']:
                item[field] = np.frombuffer(base64.decodestring(item[field]), 
                      dtype=np.float32).reshape((int(item['num_boxes']),-1))
            in_data[item['image_id']] = item['features']
    print (in_data)
    np.savez('data.npz', in_data)

Running the code, I get this error:

Error
Traceback (most recent call last)
in ()
29 #print(item['image_id'])
30 for field in ['boxes', 'features']:
---> 31 item[field] = np.frombuffer(base64.b64decode(item[field]), dtype=np.float32).reshape((int(item['num_boxes']),-1))
32 # print(item[field][0:1])
33 in_data[item['image_id']] = item

/usr/lib/python3.7/base64.py in b64decode(s, altchars, validate)
85 if validate and not re.fullmatch(b'[A-Za-z0-9+/]*={0,2}', s):
86 raise binascii.Error('Non-base64 digit found')
---> 87 return binascii.a2b_base64(s)
88
89

Error: Invalid base64-encoded string: number of data characters (360285) cannot be 1 more than a multiple of 4

I assume it conflicts with FIELDNAMES

Any idea how to resolve this?

The TSV File that I have:

TSV_FILE

Column A has image_id followed by other field names.

The result that I am expecting

NPZ_FILE

where the red boundary represents the image_id and rest of the values are 'features'

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
RA FI
  • 33
  • 7
  • can you edit your question to include the full trace back? – DrBwts Aug 08 '21 at 09:44
  • 1
    @DrBwts I just did – RA FI Aug 08 '21 at 10:00
  • `base64` encodings use 4 characters to encode 3 bytes & will pad out the data so you always get exact multiples of 4 characters. It looks like the original encoding of your `TSV` file isn't `base64` as at least one of the strings you are reading in has a length that isn't divisible by 4. As a result trying to decode as `base64` is failing. – DrBwts Aug 08 '21 at 10:44
  • @DrBwts can I remove/skip that encoding that is causing issue? – RA FI Aug 08 '21 at 13:04
  • I could be wrong but looking at your `TSV` pic some of the entries are `int` & some are possibly `base64` encoded strings. You need to be clear which is which when reading in your data. Not knowing how your data was generated or saved initially I cant say more than that. Is it possible to supply some typical sample data? – DrBwts Aug 08 '21 at 13:26

0 Answers0