3

Goal: Upload a Parquet file to MinIO - this requires converting the file to Bytes.

I've been able to do this for .csv, .json and .txt:

bytes = data.to_csv().encode('utf-8')
bytes = json.dumps(self.data, indent=4, separators=(',', ': ')).encode('utf-8')
bytes = data.encode('utf-8')

MinioConn:

from minio import Minio


class MinioConn:
    def __init__(self,
                 host='foo.com:9000',
                 access_key='CENSORED', secret_key='CENSORED',
                 secure=False):
        self.host = host
        self.access_key = access_key
        self.secret_key = secret_key
        self.secure = secure

    def client(self):
        return Minio(self.host, self.access_key, self.secret_key,
                     secure=self.secure)

My Upload Code:

import pandas as pd
import io
from fastparquet import write

import MinioConn

filename = 'myfile.parquet'
# ---
df = pd.DataFrame(data=[['tom', 10], ['nick', 15], ['juli', 14]],
                  columns=['Name', 'Age'])
df.to_parquet(filename)
# ---

data = pd.read_parquet(filename)

bytes = data.encode('utf-8')
buffer = io.BytesIO(bytes)

bucket = 'synthetic-data-gen'

client = MinioConn().client()
client.put_object(bucket,
                f'foo/bar/{filename}',
                data=buffer,
                length=len(bytes),
                content_type='application/{}'.format(filename.split('.', 1)[1]))

Traceback:

Traceback (most recent call last):
  File "test.py", line 16, in <module>
    bytes = data.encode('utf-8')
  File "/home/me/miniconda3/envs/sdg/lib/python3.8/site-packages/pandas/core/generic.py", line 5139, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'encode'

It looks like your post is mostly code; please add some more details.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
DanielBell99
  • 896
  • 5
  • 25
  • 57

1 Answers1

4

If you don't specify a filename, pandas.to_parquet, it will return bytes.

bytes_data = df.to_parquet()
buffer = io.BytesIO(bytes_data)

For older version of pandas:

buffer = io.BytesIO()
bytes_data = df.to_parquet(buffer)
buffer.seek(0)
0x26res
  • 11,925
  • 11
  • 54
  • 108