How to connect big files from Minio bucket, split it into multiple files based on timestamp and store it back using Dask framework in python

Question

I have a big text (with millions of record) as bz2 format in Minio bucket.

Now I am processing them demonstrated by the procedure below:

Call the file from Minio bucket;
Partition the files per day based on 'timestamp' column;
Remove some of the empty/blank partitioned files using 'cull_empty_partitions()';
Save the partitioned files in local directory as a .csv;
Save it back to the Minio bucket;
Remove the files from local workspace.

In this current procedure, I have to store the files into the local workspace which I don't want.

All I want to read are the .txt or the .bz2 files from my bucket, without using the local workspace partition.

Then, name the partition based on the first date in the 'timestamp' column in Dask datafrmar and store them back directly into the Minio bucket using Dask framework.

Here is my code:

import dask.dataframe as dd
from datetime import date, timedelta

path='/a/desktop/workspace/project/log/'
bucket= config['data_bucket']['abc']
folder_prefix = config["folder_prefix"]["root"]
folder_store = config["folder_prefix"]["store"]
    
col_names = [
    "id", "pro_id", "tr_id", "bo_id", "se", "lo", "timestamp", "ch"
]

data = dd.read_csv(
    folder_prefix + 'abc1/2001-01-01-logging_*.txt',
    sep = '\t', names = col_names, parse_dates = 'timestamp'],
    low_memory = False
)

data['timestamp'] = dd.to_datetime(
    user_data['timestamp'], format= '%Y-%m-%d %H:%M:%S',
    errors = 'ignore'
)

ddf = data.set_index(user_data['timestamp']).repartition(freq='1d').dropna() 
    
 
# Remove the dask dataframe which are splitted as empty
ddf = cull_empty_partitions(ddf)

# Storing the partitioned dask files to local workspace as a csv
o = ddf.to_csv("out_csv/log_user_*.csv", index=False)

# Storing the file in minio bucket
for each in o:
    if len(each) > 0 :
        print(each.split("/")[-1])
        minioClient.fput_object(bucket, folder_store+ each.split("/")[-1], each)
        # Removing partitioned csv files from local workspace 
        os.remove(each)

I can use the below code to connect the s3 buckets and get access to see the list of bucket:

import botocore, os

from botocore.client import Config
from botocore.session import Session

s3 = boto3.resource('s3',
    endpoint_url='https://blabalbla.com',
    aws_access_key_id= "abcd",
    aws_secret_access_key="sfsdfdfdcdfdfedfsdfsdf",
    config=Config(signature_version='s3v4'),
    region_name='us-east-1')
    os.environ['S3_USE_SIGV4'] = 'True'

    for bucket in s3.buckets.all():
        print(bucket.name)

When try to read the object of the buckets with the code below, it does not respond.

df = dd.read_csv('s3://bucket/myfiles.*.csv')

Any update on this regard will be highly appreciated. Thank you in advance!

Everything worked *except* for the last line? "it does not respond" - what happens, exactly? — mdurant, Oct 04 '21 at 14:25
If you can see the code, its reading the .text file with 400 MB that I already downloaded from Minio bucket to my local workspace and then reading it using: data = dd.read_csv( folder_prefix + 'abc1/2001-01-01-logging_*.txt', sep = '\t', names = col_names, parse_dates = 'timestamp'], low_memory = False) But when the file type is bz2 with 24GB in size, then my system crash. So, I need to find a way where I can directly read text file (24GB bz2 format) from Minio (not download it locally), split them and store them as separate file in Minio again without using less local memory. — MALAM, Oct 04 '21 at 15:20
Please edit your question to say this! All the stuff about parsing times, saving files, uploading... are irrelevant. This is not even a CSV problem. — mdurant, Oct 04 '21 at 15:48

How to connect big files from Minio bucket, split it into multiple files based on timestamp and store it back using Dask framework in python

0 Answers0