Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

2 answers

How to control whether pyarrow.dataset.write_dataset will overwrite previous data or append to it?

I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, the data is overwritten as opposed to a new file being created. Is there a way to "append"…

python pyarrow

asked Apr 13 '21 at 08:31

ira

2,542
2
22
36

votes

1 answer

Can not save pandas dataframe to parquet with lists of floats as cell value

I have an dataframe with a structure like this: Coumn1 Coumn2 0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212,…

python pandas parquet pyarrow

asked Mar 25 '21 at 14:04

white91wolf

votes

2 answers

Write to parquet row by row in Python

I obtain messages in async cycle and from each message I parse row which is dictionary. I would like to write these rows into parquet. To implement this, I do the following: fields = [('A', pa.float64()), ('B', pa.float64()), ('C', pa.float64()),…

python parquet pyarrow

asked Mar 11 '21 at 08:41

Artem Alexandrov

votes

1 answer

Can I access a Parquet file via index without reading the entire file into memory?

I just read that HDF5 allows you to access seek into data without reading the entire file into memory. Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype…

parquet pyarrow fastparquet

asked Feb 16 '21 at 00:06

Kermit

4,922
4
42
74

votes

5 answers

How to update data in pyarrow table?

I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this: for col_name in table2.column_names: if col_name in my_columns: print('updating values in…

python-3.x pyarrow

asked Jan 22 '21 at 13:01

raphael75

2,982
4
29
44

votes

0 answers

Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data

I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…

dask parquet pyarrow dask-dataframe

asked Dec 12 '20 at 21:37

Michael Wheeler

votes

0 answers

How does brotli achieve better parquet file compression on INT64 than INT32?

I ran a few experiments where I saved a DataFrame of random integers to parquet with brotli compression. One of my tests was to find the size ratio between storing as 32-bit integers vs 64-bit: df = pd.DataFrame( np.random.randint(0, 10000000,…

python pandas parquet pyarrow brotli

asked Sep 21 '20 at 18:54

A. Rocke

votes

1 answer

Writing Pandas df to Pyarrow Parquet table results in 'out of bounds' timestamp issue

I am receiving an out of bounds timestamp error message when attempting to convert a pandas dataframe to a pyarrow Table and write to a parquet dataset. From some researching, it seems to be a a result of pandas using nanosecond precision and…

python pandas dataframe parquet pyarrow

asked Sep 10 '20 at 02:09

user9074332

2,336
2
23
39

votes

1 answer

pyarrow add column to pyarrow table

I have a pyarrow table name final_table of shape 6132,7 I want to add column to this table list_ = ['IT'] * 6132 final_table.append_column('COUNTRY_ID', list_) but I am getting following error ArrowInvalid: Added column's length must match…

python pyarrow

asked Aug 11 '20 at 03:44

qaiser

2,770
2
17
29

votes

1 answer

How can I change the name of a column in a parquet file using Pyarrow?

I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query.…

parquet pyarrow

asked Aug 10 '20 at 23:10

mbourgon

1,286
2
17
35

votes

1 answer

How to properly create an apache plasma store from within a python program?

If I run the following as a program: import subprocess subprocess.run(['plasma_store -m 10000000000 -s /tmp/plasma'], shell=True, capture_output=True) and then run the program that uses the plasma store in a separate terminal everything works…

python pyarrow apache-arrow

asked Aug 02 '20 at 18:07

Yale Yng-Wong

votes

2 answers

Save date column with NAT(null) from pandas to parquet

I need to read integer format nullable date values ('YYYYMMDD') to pandas and then save this pandas dataframe to Parquet as a Date32[Day] format in order for Athena Glue Crawler classifier to recognize that column as a date. The code below does not…

python-3.x pandas parquet amazon-athena pyarrow

asked Jul 13 '20 at 21:41

Yun Ling

votes

1 answer

Is it possible to read parquet files from S3 access point using pyarrow

It is possible to read parquet files from S3 as shown here or here. I am working with S3 access points. Having S3 access point ARN is it possible to read parquet files from it? I am trying with the following sample code: import s3fs import…

python-3.x amazon-web-services boto3 pyarrow amazon-s3-access-points

asked Jul 08 '20 at 10:14

Krzysztof Słowiński

6,239
8
44
62

votes

2 answers

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still…

csv pyarrow apache-arrow

asked Jun 02 '20 at 13:33

matthewmturner

votes

3 answers

Generating large DataFrame in a distributed way in pyspark efficiently (without pyspark.sql.Row)

The problem boils down to the following: I want to generate a DataFrame in pyspark using existing parallelized collection of inputs and a function which given one input can generate a relatively large batch of rows. In the example below I want to…

apache-spark pyspark pyarrow apache-arrow

asked May 25 '20 at 17:35

Alexander Pivovarov

4,850
1
11
34

Prev 1 2 3

…

71 72 Next