I have TSV file with headers and some values. I want to read it and convert it into base 64 and later to NPZ format. I have a code as follows:
import base64
import numpy as np
import csv
import sys
import zlib
import time
import mmap
csv.field_size_limit(sys.maxsize)
FIELDNAMES = ['image_id', 'image_w','image_h','num_boxes', 'boxes', 'features']
infile = '/content/drive/MyDrive/test2015_resnet101_faster_rcnn_genome.tsv.0'
if __name__ == '__main__':
in_data = {}
with open(infile, "r+b") as tsv_in_file:
reader = csv.DictReader(tsv_in_file, delimiter='\t', fieldnames = FIELDNAMES)
for item in reader:
for field in ['features']:
item[field] = np.frombuffer(base64.decodestring(item[field]),
dtype=np.float32).reshape((int(item['num_boxes']),-1))
in_data[item['image_id']] = item['features']
print (in_data)
np.savez('data.npz', in_data)
Running the code, I get this error:
Error
Traceback (most recent call last)
in ()
29 #print(item['image_id'])
30 for field in ['boxes', 'features']:
---> 31 item[field] = np.frombuffer(base64.b64decode(item[field]), dtype=np.float32).reshape((int(item['num_boxes']),-1))
32 # print(item[field][0:1])
33 in_data[item['image_id']] = item/usr/lib/python3.7/base64.py in b64decode(s, altchars, validate)
85 if validate and not re.fullmatch(b'[A-Za-z0-9+/]*={0,2}', s):
86 raise binascii.Error('Non-base64 digit found')
---> 87 return binascii.a2b_base64(s)
88
89Error: Invalid base64-encoded string: number of data characters (360285) cannot be 1 more than a multiple of 4
I assume it conflicts with FIELDNAMES
Any idea how to resolve this?
The TSV File that I have:
Column A has image_id
followed by other field names.
The result that I am expecting
where the red boundary represents the image_id
and rest of the values are 'features'