A Python interface to the Parquet file format.
Questions tagged [fastparquet]
141 questions
0
votes
0 answers
asynchronous processing of data but sequential file save in multiprocessing
I'm processing really large log file - e.g. 300 GB and I have a script which chunk reads the file and asynchronously process the data (need to read some key:values from it) in pool of processes and save it to a parquet file.
def process_line(line:…

sarkafa
- 1
0
votes
1 answer
Error converting column to bytes using encoding UTF8
I got below error when writing dask dataframe to S3. Couldn't figure out why. Does anybody know how to fix.
dd.from_pandas(pred, npartitions=npart).to_parquet(out_path)
The error is
error.. Error converting column "team_nm" to bytes using encoding…

Justin Shan
- 81
- 1
- 2
0
votes
1 answer
Unable to write parquet with DATE as logical type for a column from pandas
I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from…

Behroz Sikander
- 3,885
- 3
- 22
- 36
0
votes
1 answer
Is there the best way to train binary classification with 1000 parquet files?
I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem.
files =…

dgks0n
- 9
- 1
0
votes
1 answer
got an length eroor when using read_parquet function
When i use read_parquet method to read parquet file, it occurs Column 8 named hostIp expected length 548 but got length 549 error, hostIP is one column in REQUIRED_COLUMNS.
import pandas as pd
REQUIRED_COLUMNS = [...]
path = ...
data =…
0
votes
0 answers
Parquet file created by fastparquet engine not understood by hive query
I am creating parquet files in AWS Lambdas using fastparquet (smaller library than pyarrow. Easy to work with in lambdas). My parquet has int32, string and timestamp columns. I am getting strange error. Date and integer field is making me mad. Text…

Ashish Kumar Mondal
- 459
- 6
- 13
0
votes
1 answer
Error installing tsflex on Mac: "Failed building wheel for fastparquet"
I've come across an issue while attempting to install the tsflex package on my Mac using pip3. After running pip3 install tsflex, I received the following error message:
Collecting tsflex
Using cached tsflex-0.1.1.9-py3-none-any.whl (50…

Sira
- 11
- 3
0
votes
1 answer
parquet time stamp overflow with fastparquet/pyarrow
I have a parquet file I am reading from s3 using fastparquet/pandas , the parquet file has a column with date 2022-10-06 00:00:00 , and I see it is wrapping it as 1970-01-20 06:30:14.400, Please see code and error and screen shot of parquet file…

Bill
- 363
- 3
- 14
0
votes
1 answer
Dask ignores knowledge about divisions for parquet dataset
I've got a parquet dataset located in the directory "dataset_path" with an index column date.
The metadata was created by dask and the relevant schema data looks as follows:
date: timestamp[us]
-- schema metadata --
pandas: '{"index_columns":…

Dask Apprentice
- 3
- 3
0
votes
1 answer
How can I query parquet files with the Polars Python API?
I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column.
For example, I might have a column name in that .parquet file and want to get back the first (or all of) the rows with a chosen name.
How…

SamTheProgrammer
- 1,051
- 1
- 10
- 28
0
votes
0 answers
Parquet file too wide to work with in PySpark
I have a large Parquet file with 25k columns that is about 10GB. I'm trying to view it, and convert some rows to CSV.
All the tools I've tried have blown up (parquet-tools, fastparquet, pandas) so I'm using PySpark now but am running into Java out…

Vishaal
- 735
- 3
- 13
0
votes
1 answer
How can append a dataframe data to an existing file in adls2 using fastparquet
I have a file in adls2.By using below statement I am unable to append the data to the existing file.
filepath =…
0
votes
0 answers
How to compress a large csv "on the fly"?
I recently downloaded a CSV that turned out larger than I anticipated (the size wasn't available until the download finished). The file is >100 GB and my drive only has around 25 GB free at this point.
Since CSV is not very space efficient, I'm…

Khashir
- 341
- 3
- 20
0
votes
1 answer
"ArrowInvalid: Can't unify schema with duplicate field names" read parquet files from s3 using dask;
using a query I dump data from redshift to parquet
UNLOAD
('
SELECT
delivered_at
, flow_name
, variant_name
, user_id
') TO
's3://data/raw/redshift/all_campaigns'
IAM_ROLE 'arn:aws:iam::XYZ:role/redshift'
…

Areza
- 5,623
- 7
- 48
- 79
0
votes
1 answer
Error installing truefoundry experiment tracking library (pip install mlfoundry)
I tried to install mlfoundry on my Mac m1 laptop. It gives out the following error:
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for fastparquet
Running setup.py clean for…