Questions tagged [fastparquet]

A Python interface to the Parquet file format.

Resources:

141 questions
0
votes
0 answers

asynchronous processing of data but sequential file save in multiprocessing

I'm processing really large log file - e.g. 300 GB and I have a script which chunk reads the file and asynchronously process the data (need to read some key:values from it) in pool of processes and save it to a parquet file. def process_line(line:…
0
votes
1 answer

Error converting column to bytes using encoding UTF8

I got below error when writing dask dataframe to S3. Couldn't figure out why. Does anybody know how to fix. dd.from_pandas(pred, npartitions=npart).to_parquet(out_path) The error is error.. Error converting column "team_nm" to bytes using encoding…
Justin Shan
  • 81
  • 1
  • 2
0
votes
1 answer

Unable to write parquet with DATE as logical type for a column from pandas

I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from…
Behroz Sikander
  • 3,885
  • 3
  • 22
  • 36
0
votes
1 answer

Is there the best way to train binary classification with 1000 parquet files?

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem. files =…
0
votes
1 answer

got an length eroor when using read_parquet function

When i use read_parquet method to read parquet file, it occurs Column 8 named hostIp expected length 548 but got length 549 error, hostIP is one column in REQUIRED_COLUMNS. import pandas as pd REQUIRED_COLUMNS = [...] path = ... data =…
0
votes
0 answers

Parquet file created by fastparquet engine not understood by hive query

I am creating parquet files in AWS Lambdas using fastparquet (smaller library than pyarrow. Easy to work with in lambdas). My parquet has int32, string and timestamp columns. I am getting strange error. Date and integer field is making me mad. Text…
0
votes
1 answer

Error installing tsflex on Mac: "Failed building wheel for fastparquet"

I've come across an issue while attempting to install the tsflex package on my Mac using pip3. After running pip3 install tsflex, I received the following error message: Collecting tsflex Using cached tsflex-0.1.1.9-py3-none-any.whl (50…
Sira
  • 11
  • 3
0
votes
1 answer

parquet time stamp overflow with fastparquet/pyarrow

I have a parquet file I am reading from s3 using fastparquet/pandas , the parquet file has a column with date 2022-10-06 00:00:00 , and I see it is wrapping it as 1970-01-20 06:30:14.400, Please see code and error and screen shot of parquet file…
Bill
  • 363
  • 3
  • 14
0
votes
1 answer

Dask ignores knowledge about divisions for parquet dataset

I've got a parquet dataset located in the directory "dataset_path" with an index column date. The metadata was created by dask and the relevant schema data looks as follows: date: timestamp[us] -- schema metadata -- pandas: '{"index_columns":…
0
votes
1 answer

How can I query parquet files with the Polars Python API?

I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column. For example, I might have a column name in that .parquet file and want to get back the first (or all of) the rows with a chosen name. How…
SamTheProgrammer
  • 1,051
  • 1
  • 10
  • 28
0
votes
0 answers

Parquet file too wide to work with in PySpark

I have a large Parquet file with 25k columns that is about 10GB. I'm trying to view it, and convert some rows to CSV. All the tools I've tried have blown up (parquet-tools, fastparquet, pandas) so I'm using PySpark now but am running into Java out…
Vishaal
  • 735
  • 3
  • 13
0
votes
1 answer

How can append a dataframe data to an existing file in adls2 using fastparquet

I have a file in adls2.By using below statement I am unable to append the data to the existing file. filepath =…
0
votes
0 answers

How to compress a large csv "on the fly"?

I recently downloaded a CSV that turned out larger than I anticipated (the size wasn't available until the download finished). The file is >100 GB and my drive only has around 25 GB free at this point. Since CSV is not very space efficient, I'm…
Khashir
  • 341
  • 3
  • 20
0
votes
1 answer

"ArrowInvalid: Can't unify schema with duplicate field names" read parquet files from s3 using dask;

using a query I dump data from redshift to parquet UNLOAD (' SELECT delivered_at , flow_name , variant_name , user_id ') TO 's3://data/raw/redshift/all_campaigns' IAM_ROLE 'arn:aws:iam::XYZ:role/redshift' …
Areza
  • 5,623
  • 7
  • 48
  • 79
0
votes
1 answer

Error installing truefoundry experiment tracking library (pip install mlfoundry)

I tried to install mlfoundry on my Mac m1 laptop. It gives out the following error: note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for fastparquet Running setup.py clean for…