Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
6
votes
1 answer

How does apache arrow facilitate "No overhead for cross-system communication"?

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (through the lens of pyarrow) is that it describes the…
kemri
  • 149
  • 12
6
votes
1 answer

How to write Parquet with user defined schema through pyarrow

When I execute below code - gets following error ValueError: Table schema does not match schema used to create file. import pandas as pd import pyarrow as pa import pyarrow.parquet as pq fields = [ ('one', pa.int64()), ('two', pa.string(),…
Sachin Jain
  • 79
  • 1
  • 7
6
votes
3 answers

PyArrow: Store list of dicts in parquet using nested types

I want to store the following pandas data frame in a parquet file using PyArrow: import pandas as pd df = pd.DataFrame({'field': [[{}, {}]]}) The type of the field column is list of dicts: field 0 [{}, {}] I first define the corresponding…
SergiyKolesnikov
  • 7,369
  • 2
  • 26
  • 47
6
votes
1 answer

Sharing objects across workers using pyarrow

I would like to give read-only access to shared DataFrame to multiple worker processes created by multiprocessing.Pool.map(). I would like to avoid copying and pickling. I understood that pyarrow can be used for that. However, I find their…
Konstantin
  • 2,451
  • 1
  • 24
  • 26
6
votes
2 answers

Pyarrow read/write from s3

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow. Here is my code: import pyarrow.parquet as pq import pyarrow as pa import s3fs s3 = s3fs.S3FileSystem() bucket =…
thotam
  • 941
  • 2
  • 16
  • 31
6
votes
2 answers

How to install pyarrow on an Alpine Docker image?

I am trying to install pyarrow using pip in my alpine docker image, but pip is unable to find the package. I'm using the following Dockerfile: FROM python:3.6-alpine3.7 RUN apk add --no-cache musl-dev linux-headers g++ RUN pip install…
thotam
  • 941
  • 2
  • 16
  • 31
6
votes
2 answers

Reading/writing pyarrow tensors from/to parquet files

In pyarrow, what is the suggested way of writing a pyarrow.Tensor (e.g. created from a numpy.ndarray) to a Parquet file? Is it even possible without having to go through pyarrow.Table and pandas.DataFrame?
Martin Studer
  • 2,213
  • 1
  • 18
  • 23
5
votes
1 answer

How to use Apache Arrow IPC from multiple processes (possibly from different languages)?

I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another. So I create a pyarrow.Table like this: a1 = pa.array(list(range(3))) a2 =…
suvayu
  • 4,271
  • 2
  • 29
  • 35
5
votes
0 answers

ArrowInvalid: GetFileInfo() yielded path which is outside base dir parquet

I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before. Occasionally, this data has been overwritten with some previous…
Wassadamo
  • 1,176
  • 12
  • 32
5
votes
1 answer

Is it possible to append rows to an existing Arrow (PyArrow) Table?

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said Table columns in Arrow C++ can be chunked, so that appending to a…
astrojuanlu
  • 6,744
  • 8
  • 45
  • 105
5
votes
2 answers

Error Loading DataFrame to BigQuery Table (pyarrow.lib.ArrowTypeError: object of type cannot be converted to int)

I have a CSV stored in GCS which I want to load it to BigQuery table. But I need to do some pre-process first so I load it to DataFrame and later load to BigQuery table import pandas as pd import json from google.cloud import…
emp
  • 602
  • 3
  • 11
  • 22
5
votes
3 answers

Add new column to a HuggingFace dataset

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset. dataset = dataset.add_column('embeddings', embeddings) The variable embeddings is a numpy memmap array of size (5000000, 512). But I get this…
albero
  • 169
  • 2
  • 9
5
votes
2 answers

PyArrow: How to copy files from local to remote using new filesystem interface?

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)? I have read the documentation back-and-forth, and tried a few things out…
Andor
  • 5,523
  • 5
  • 26
  • 24
5
votes
1 answer

dask read parquet and specify schema

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow? I have a bunch of parquet files in a bucket but some of the fields have slightly inconsistent names. I could…
Ray Bell
  • 1,508
  • 4
  • 18
  • 45
5
votes
1 answer

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that: reads in a hdfs parquet file converts it to a pandas dataframe loops through specific columns and changes some values writes the dataframe back to a parquet file Then…
raphael75
  • 2,982
  • 4
  • 29
  • 44