Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
4
votes
2 answers

How to control whether pyarrow.dataset.write_dataset will overwrite previous data or append to it?

I am trying to use pyarrow.dataset.write_dataset function to write data into hdfs. However, if i write into a directory that already exists and has some data, the data is overwritten as opposed to a new file being created. Is there a way to "append"…
ira
  • 2,542
  • 2
  • 22
  • 36
4
votes
1 answer

Can not save pandas dataframe to parquet with lists of floats as cell value

I have an dataframe with a structure like this: Coumn1 Coumn2 0 (0.00030271668219938874, 0.0002655923890415579... (0.0016430083196610212,…
white91wolf
  • 400
  • 4
  • 18
4
votes
2 answers

Write to parquet row by row in Python

I obtain messages in async cycle and from each message I parse row which is dictionary. I would like to write these rows into parquet. To implement this, I do the following: fields = [('A', pa.float64()), ('B', pa.float64()), ('C', pa.float64()),…
4
votes
1 answer

Can I access a Parquet file via index without reading the entire file into memory?

I just read that HDF5 allows you to access seek into data without reading the entire file into memory. Is this seeking behavior possible in Parquet files without Java (non-pyspark solutions)? I am using Parquet because of the strong dtype…
Kermit
  • 4,922
  • 4
  • 42
  • 74
4
votes
5 answers

How to update data in pyarrow table?

I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this: for col_name in table2.column_names: if col_name in my_columns: print('updating values in…
raphael75
  • 2,982
  • 4
  • 29
  • 44
4
votes
0 answers

Dask Dataframe from parquet files: OSError: Couldn't deserialize thrift: TProtocolException: Invalid data

I'm generating a Dask dataframe to be used downstream in a clustering algorithm supplied by dask-ml. In a previous step in my pipeline I read a dataframe from disk using the dask.dataframe.read_parquet, apply a transformation to add columns using…
Michael Wheeler
  • 849
  • 1
  • 10
  • 29
4
votes
0 answers

How does brotli achieve better parquet file compression on INT64 than INT32?

I ran a few experiments where I saved a DataFrame of random integers to parquet with brotli compression. One of my tests was to find the size ratio between storing as 32-bit integers vs 64-bit: df = pd.DataFrame( np.random.randint(0, 10000000,…
A. Rocke
  • 41
  • 2
4
votes
1 answer

Writing Pandas df to Pyarrow Parquet table results in 'out of bounds' timestamp issue

I am receiving an out of bounds timestamp error message when attempting to convert a pandas dataframe to a pyarrow Table and write to a parquet dataset. From some researching, it seems to be a a result of pandas using nanosecond precision and…
user9074332
  • 2,336
  • 2
  • 23
  • 39
4
votes
1 answer

pyarrow add column to pyarrow table

I have a pyarrow table name final_table of shape 6132,7 I want to add column to this table list_ = ['IT'] * 6132 final_table.append_column('COUNTRY_ID', list_) but I am getting following error ArrowInvalid: Added column's length must match…
qaiser
  • 2,770
  • 2
  • 17
  • 29
4
votes
1 answer

How can I change the name of a column in a parquet file using Pyarrow?

I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query.…
mbourgon
  • 1,286
  • 2
  • 17
  • 35
4
votes
1 answer

How to properly create an apache plasma store from within a python program?

If I run the following as a program: import subprocess subprocess.run(['plasma_store -m 10000000000 -s /tmp/plasma'], shell=True, capture_output=True) and then run the program that uses the plasma store in a separate terminal everything works…
4
votes
2 answers

Save date column with NAT(null) from pandas to parquet

I need to read integer format nullable date values ('YYYYMMDD') to pandas and then save this pandas dataframe to Parquet as a Date32[Day] format in order for Athena Glue Crawler classifier to recognize that column as a date. The code below does not…
Yun Ling
  • 113
  • 1
  • 8
4
votes
1 answer

Is it possible to read parquet files from S3 access point using pyarrow

It is possible to read parquet files from S3 as shown here or here. I am working with S3 access points. Having S3 access point ARN is it possible to read parquet files from it? I am trying with the following sample code: import s3fs import…
4
votes
2 answers

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still…
matthewmturner
  • 566
  • 7
  • 21
4
votes
3 answers

Generating large DataFrame in a distributed way in pyspark efficiently (without pyspark.sql.Row)

The problem boils down to the following: I want to generate a DataFrame in pyspark using existing parallelized collection of inputs and a function which given one input can generate a relatively large batch of rows. In the example below I want to…
Alexander Pivovarov
  • 4,850
  • 1
  • 11
  • 34