Highest Voted 'fastparquet' Questions

0

votes

0 answers

asynchronous processing of data but sequential file save in multiprocessing

I'm processing really large log file - e.g. 300 GB and I have a script which chunk reads the file and asynchronously process the data (need to read some key:values from it) in pool of processes and save it to a parquet file. def process_line(line:…

python multiprocessing fastparquet

asked Sep 03 '23 at 08:43

sarkafa

1

0

votes

1 answer

Error converting column to bytes using encoding UTF8

I got below error when writing dask dataframe to S3. Couldn't figure out why. Does anybody know how to fix. dd.from_pandas(pred, npartitions=npart).to_parquet(out_path) The error is error.. Error converting column "team_nm" to bytes using encoding…

python dask fastparquet

asked Aug 29 '23 at 16:53

Justin Shan

81
1
2

0

votes

1 answer

Unable to write parquet with DATE as logical type for a column from pandas

I am trying to write a parquet file which contains one date column having logical type in parquet as DATE and physical type as INT32. I am writing the parquet file using pandas and using fastparquet as the engine since I need to stream the data from…

python pandas google-bigquery parquet fastparquet

asked Aug 22 '23 at 12:00

Behroz Sikander

3,885
3
22
36

0

votes

1 answer

Is there the best way to train binary classification with 1000 parquet files?

I'm training a binary classification model with a huge dataset in parquet format. However, it has a lot, I cannot fill all of the data into memory. Currently, I am doing like below but I'm facing out-of-memory problem. files =…

python scikit-learn artificial-intelligence random-forest fastparquet

asked Aug 01 '23 at 08:29

dgks0n

9
1

0

votes

1 answer

got an length eroor when using read_parquet function

When i use read_parquet method to read parquet file, it occurs Column 8 named hostIp expected length 548 but got length 549 error, hostIP is one column in REQUIRED_COLUMNS. import pandas as pd REQUIRED_COLUMNS = [...] path = ... data =…

pandas pyarrow fastparquet

asked Jul 18 '23 at 11:18

chunhuagod

1

0

votes

0 answers

Parquet file created by fastparquet engine not understood by hive query

I am creating parquet files in AWS Lambdas using fastparquet (smaller library than pyarrow. Easy to work with in lambdas). My parquet has int32, string and timestamp columns. I am getting strange error. Date and integer field is making me mad. Text…

python-3.x pandas hive pyarrow fastparquet

asked Jul 01 '23 at 22:27

Ashish Kumar Mondal

459
6
13

0

votes

1 answer

Error installing tsflex on Mac: "Failed building wheel for fastparquet"

I've come across an issue while attempting to install the tsflex package on my Mac using pip3. After running pip3 install tsflex, I received the following error message: Collecting tsflex Using cached tsflex-0.1.1.9-py3-none-any.whl (50…

python python-3.x fastparquet

asked May 28 '23 at 17:06

Sira

11
3

0

votes

1 answer

parquet time stamp overflow with fastparquet/pyarrow

I have a parquet file I am reading from s3 using fastparquet/pandas , the parquet file has a column with date 2022-10-06 00:00:00 , and I see it is wrapping it as 1970-01-20 06:30:14.400, Please see code and error and screen shot of parquet file…

python pandas parquet pyarrow fastparquet

asked Apr 03 '23 at 01:36

Bill

363
3
14

0

votes

1 answer

Dask ignores knowledge about divisions for parquet dataset

I've got a parquet dataset located in the directory "dataset_path" with an index column date. The metadata was created by dask and the relevant schema data looks as follows: date: timestamp[us] -- schema metadata -- pandas: '{"index_columns":…

dask parquet pyarrow dask-dataframe fastparquet

asked Apr 02 '23 at 14:08

Dask Apprentice

3
3

0

votes

1 answer

How can I query parquet files with the Polars Python API?

I have a .parquet file, and would like to use Python to quickly and efficiently query that file by a column. For example, I might have a column name in that .parquet file and want to get back the first (or all of) the rows with a chosen name. How…

python parquet python-polars fastparquet

asked Feb 17 '23 at 17:12

SamTheProgrammer

1,051
1
10
28

0

votes

0 answers

Parquet file too wide to work with in PySpark

I have a large Parquet file with 25k columns that is about 10GB. I'm trying to view it, and convert some rows to CSV. All the tools I've tried have blown up (parquet-tools, fastparquet, pandas) so I'm using PySpark now but am running into Java out…

apache-spark pyspark parquet fastparquet

asked Feb 01 '23 at 21:02

Vishaal

735
3
13

0

votes

1 answer

How can append a dataframe data to an existing file in adls2 using fastparquet

I have a file in adls2.By using below statement I am unable to append the data to the existing file. filepath =…

append azure-data-lake-gen2 fastparquet

asked Dec 14 '22 at 10:50

Mathuraju Sivasankar

1
1

0

votes

0 answers

How to compress a large csv "on the fly"?

I recently downloaded a CSV that turned out larger than I anticipated (the size wasn't available until the download finished). The file is >100 GB and my drive only has around 25 GB free at this point. Since CSV is not very space efficient, I'm…

python csv compression pyarrow fastparquet

asked Dec 08 '22 at 04:51

Khashir

341
3
20

0

votes

1 answer

"ArrowInvalid: Can't unify schema with duplicate field names" read parquet files from s3 using dask;

using a query I dump data from redshift to parquet UNLOAD (' SELECT delivered_at , flow_name , variant_name , user_id ') TO 's3://data/raw/redshift/all_campaigns' IAM_ROLE 'arn:aws:iam::XYZ:role/redshift' …

python dask parquet pyarrow fastparquet

asked Jul 07 '22 at 14:04

Areza

5,623
7
48
79

0

votes

1 answer

Error installing truefoundry experiment tracking library (pip install mlfoundry)

I tried to install mlfoundry on my Mac m1 laptop. It gives out the following error: note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for fastparquet Running setup.py clean for…

python fastparquet pep517

asked Jun 14 '22 at 14:12

Sohan Bansal

1

Questions tagged [fastparquet]

Resources:

asynchronous processing of data but sequential file save in multiprocessing

Error converting column to bytes using encoding UTF8

Unable to write parquet with DATE as logical type for a column from pandas

Is there the best way to train binary classification with 1000 parquet files?

got an length eroor when using read_parquet function

Parquet file created by fastparquet engine not understood by hive query

Error installing tsflex on Mac: "Failed building wheel for fastparquet"

parquet time stamp overflow with fastparquet/pyarrow

Dask ignores knowledge about divisions for parquet dataset

How can I query parquet files with the Polars Python API?

Parquet file too wide to work with in PySpark

How can append a dataframe data to an existing file in adls2 using fastparquet

How to compress a large csv "on the fly"?

"ArrowInvalid: Can't unify schema with duplicate field names" read parquet files from s3 using dask;

Error installing truefoundry experiment tracking library (pip install mlfoundry)