I have a big text (with millions of record) as bz2
format in Minio bucket
.
Now I am processing them demonstrated by the procedure below:
Call the file from
Minio bucket
;Partition the files per day based on
'timestamp'
column;Remove some of the empty/blank partitioned files using
'cull_empty_partitions()'
;Save the partitioned files in local directory as a
.csv
;Save it back to the
Minio bucket
;Remove the files from local workspace.
In this current procedure, I have to store the files into the local workspace which I don't want.
All I want to read are the .txt
or the .bz2
files from my bucket, without using the local workspace partition.
Then, name the partition based on the first date in the 'timestamp'
column in Dask datafrmar
and store them back directly into the Minio bucket
using Dask framework.
Here is my code:
import dask.dataframe as dd
from datetime import date, timedelta
path='/a/desktop/workspace/project/log/'
bucket= config['data_bucket']['abc']
folder_prefix = config["folder_prefix"]["root"]
folder_store = config["folder_prefix"]["store"]
col_names = [
"id", "pro_id", "tr_id", "bo_id", "se", "lo", "timestamp", "ch"
]
data = dd.read_csv(
folder_prefix + 'abc1/2001-01-01-logging_*.txt',
sep = '\t', names = col_names, parse_dates = 'timestamp'],
low_memory = False
)
data['timestamp'] = dd.to_datetime(
user_data['timestamp'], format= '%Y-%m-%d %H:%M:%S',
errors = 'ignore'
)
ddf = data.set_index(user_data['timestamp']).repartition(freq='1d').dropna()
# Remove the dask dataframe which are splitted as empty
ddf = cull_empty_partitions(ddf)
# Storing the partitioned dask files to local workspace as a csv
o = ddf.to_csv("out_csv/log_user_*.csv", index=False)
# Storing the file in minio bucket
for each in o:
if len(each) > 0 :
print(each.split("/")[-1])
minioClient.fput_object(bucket, folder_store+ each.split("/")[-1], each)
# Removing partitioned csv files from local workspace
os.remove(each)
I can use the below code to connect the s3 buckets
and get access to see the list of bucket:
import botocore, os
from botocore.client import Config
from botocore.session import Session
s3 = boto3.resource('s3',
endpoint_url='https://blabalbla.com',
aws_access_key_id= "abcd",
aws_secret_access_key="sfsdfdfdcdfdfedfsdfsdf",
config=Config(signature_version='s3v4'),
region_name='us-east-1')
os.environ['S3_USE_SIGV4'] = 'True'
for bucket in s3.buckets.all():
print(bucket.name)
When try to read the object of the buckets with the code below, it does not respond.
df = dd.read_csv('s3://bucket/myfiles.*.csv')
Any update on this regard will be highly appreciated. Thank you in advance!