Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

1 answer

How does apache arrow facilitate "No overhead for cross-system communication"?

I've been very interested in Apache Arrow for a bit now due to the promises of "zero copy reads", "zero serde", and "No overhead for cross-system communication". My understanding of the project (through the lens of pyarrow) is that it describes the…

asked Sep 17 '19 at 00:54

kemri

votes

1 answer

How to write Parquet with user defined schema through pyarrow

When I execute below code - gets following error ValueError: Table schema does not match schema used to create file. import pandas as pd import pyarrow as pa import pyarrow.parquet as pq fields = [ ('one', pa.int64()), ('two', pa.string(),…

python-3.x pyarrow

asked Jul 11 '19 at 04:30

Sachin Jain

votes

3 answers

PyArrow: Store list of dicts in parquet using nested types

I want to store the following pandas data frame in a parquet file using PyArrow: import pandas as pd df = pd.DataFrame({'field': [[{}, {}]]}) The type of the field column is list of dicts: field 0 [{}, {}] I first define the corresponding…

python pandas parquet pyarrow

asked Feb 21 '19 at 22:07

SergiyKolesnikov

7,369
2
26
47

votes

1 answer

Sharing objects across workers using pyarrow

I would like to give read-only access to shared DataFrame to multiple worker processes created by multiprocessing.Pool.map(). I would like to avoid copying and pickling. I understood that pyarrow can be used for that. However, I find their…

python pandas python-multiprocessing pyarrow

asked Feb 07 '19 at 20:51

Konstantin

2,451
1
24
26

votes

2 answers

Pyarrow read/write from s3

Is it possible to read and write parquet files from one folder to another folder in s3 without converting into pandas using pyarrow. Here is my code: import pyarrow.parquet as pq import pyarrow as pa import s3fs s3 = s3fs.S3FileSystem() bucket =…

python pyarrow

asked Mar 27 '18 at 12:42

thotam

votes

2 answers

How to install pyarrow on an Alpine Docker image?

I am trying to install pyarrow using pip in my alpine docker image, but pip is unable to find the package. I'm using the following Dockerfile: FROM python:3.6-alpine3.7 RUN apk add --no-cache musl-dev linux-headers g++ RUN pip install…

python docker alpine-linux pyarrow

asked Mar 01 '18 at 22:35

thotam

votes

2 answers

Reading/writing pyarrow tensors from/to parquet files

In pyarrow, what is the suggested way of writing a pyarrow.Tensor (e.g. created from a numpy.ndarray) to a Parquet file? Is it even possible without having to go through pyarrow.Table and pandas.DataFrame?

numpy parquet tensor pyarrow

asked Oct 17 '17 at 15:57

Martin Studer

2,213
1
18
23

votes

1 answer

How to use Apache Arrow IPC from multiple processes (possibly from different languages)?

I'm not sure where to begin, so looking for some guidance. I'm looking for a way to create some arrays/tables in one process, and have it accessible (read-only) from another. So I create a pyarrow.Table like this: a1 = pa.array(list(range(3))) a2 =…

python ipc pyarrow apache-arrow

asked Feb 08 '23 at 23:34

suvayu

4,271
2
29
35

votes

0 answers

ArrowInvalid: GetFileInfo() yielded path which is outside base dir parquet

I have a parquet dataset stored in my S3 bucket with multiple partition files. I want to read it into my pandas dataframe, but am getting this ArrowInvalid error when I didn't before. Occasionally, this data has been overwritten with some previous…

python pandas parquet pyarrow

asked Apr 28 '22 at 18:09

Wassadamo

1,176
12
32

votes

1 answer

Is it possible to append rows to an existing Arrow (PyArrow) Table?

I am aware that "Many Arrow objects are immutable: once constructed, their logical properties cannot change anymore" (docs). In this blog post by one of the Arrow creators it's said Table columns in Arrow C++ can be chunked, so that appending to a…

pyarrow apache-arrow

asked Mar 10 '22 at 17:58

astrojuanlu

6,744
8
45
105

votes

2 answers

Error Loading DataFrame to BigQuery Table (pyarrow.lib.ArrowTypeError: object of type cannot be converted to int)

I have a CSV stored in GCS which I want to load it to BigQuery table. But I need to do some pre-process first so I load it to DataFrame and later load to BigQuery table import pandas as pd import json from google.cloud import…

python pandas numpy google-bigquery pyarrow

asked Feb 21 '22 at 08:47

emp

votes

3 answers

Add new column to a HuggingFace dataset

In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset. dataset = dataset.add_column('embeddings', embeddings) The variable embeddings is a numpy memmap array of size (5000000, 512). But I get this…

python numpy word-embedding pyarrow huggingface-datasets

asked Nov 22 '21 at 10:56

albero

votes

2 answers

PyArrow: How to copy files from local to remote using new filesystem interface?

Could somebody give me a hint on how can I copy a file form a local filesystem to a HDFS filesystem using PyArrow's new filesystem interface (i.e. upload, copyFromLocal)? I have read the documentation back-and-forth, and tried a few things out…

python hdfs pyarrow apache-arrow

asked Jul 28 '21 at 11:11

Andor

5,523
5
26
24

votes

1 answer

dask read parquet and specify schema

Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow? I have a bunch of parquet files in a bucket but some of the fields have slightly inconsistent names. I could…

pandas apache-spark dask parquet pyarrow

asked Apr 01 '21 at 02:01

Ray Bell

1,508
4
18
45

votes

1 answer

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that: reads in a hdfs parquet file converts it to a pandas dataframe loops through specific columns and changes some values writes the dataframe back to a parquet file Then…

python-3.x pandas dataframe parquet pyarrow

asked Jan 21 '21 at 14:25

raphael75

2,982
4
29
44

Prev 1 2 3

…

71 72 Next