Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions

votes

0 answers

How to filter pyarrow pyarrow.lib.Date32Value values?

For example if you load a pyarrow parqetdataset, you can get at the data but is there an easy way of filtering this before converting to datetime.date ? datetime.date is a python object so would be good to have a fast way of cutting the data down…

pyarrow

asked Jun 24 '19 at 15:33

mathtick

6,487
13
56
101

votes

1 answer

Converting Python seqence to arrow Array via C++ API

I'm attempting to investigate how Arrow converts a python list into an equivalent arrow::Array using the C++ API below. #include #include #include #include #include #include…

pyarrow apache-arrow

asked Jun 04 '19 at 14:08

clery00

votes

1 answer

pyarrow hdfs reads more data than requested

I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead. The pyarrow parquet reader doesn't have this behavior, and I am…

python hdfs pyarrow

asked May 16 '19 at 20:31

Iva

votes

1 answer

Unable to load parquet files with same columns names but with different order

Unable to load parquet files with same columns names but with a different order. Scenario: ABD-MacBook-Pro:ttt abd$ tree . ├── testing1.paquet └── testing2.paquet I have two parquet files as mentioned above. The column names are the same in both…

pandas python-3.6 pyarrow

asked May 13 '19 at 09:29

Naga Budigam

votes

2 answers

What could be the explaination of this "pyarrow.lib.ArrowIOError: HDFS file does not exist" error when trying to read files in hdfs using Dask?

I'm using Dask Distributed and I'm trying to create a dataframe from a CSV stored in HDFS. I suppose the connection to HDFS is successful as I'm able to print the dataframe columns' names. However, I get the following error when I'm trying to use…

python dask dask-distributed pyarrow

asked Apr 30 '19 at 13:53

Sevy

votes

1 answer

koalas pip install fails on pyarrow dependency

I tried installing Databricks' new koalas package using the recommended pip install koalas on but it failed on the pyarrow install. I then installed pyarrow and retried koalas but it still failed on pyarrow. I visited the Github page which informed…

python pandas pyspark databricks pyarrow

asked Apr 25 '19 at 14:07

Frank B.

1,813
5
24
44

votes

1 answer

What could be the explanation of this 'pyarrow.lib.ArrowIOError'?

I'm working on an HDP cluster and I'm trying to read a .csv file from HDFS using pyarrow. I am able to connect to hdfs and print information about the file using the info() function. But when it comes to reading the content of the file, I get a…

python hdfs pyarrow

asked Apr 16 '19 at 09:33

Sevy

votes

1 answer

RuntimeError pyarrow not installed

I install pyarrow 0.13.0 in a virtual environment on Ubuntu 16.04 using pip and it was successfully installed, but whenever I call it, I get the error below.…

python parquet pyarrow

asked Apr 06 '19 at 20:57

Stella Ella

votes

1 answer

Can Apache arrow support infinite level nested struct?

In this Apache arrow documentation page https://arrow.apache.org/docs/format/Metadata.html It seems to support it. Would some post some code to show infinite level nested struct please? Thanks.

java c++ pyarrow apache-arrow

asked Mar 30 '19 at 12:36

Wrecker

votes

1 answer

How do I set which libstdc++.so to be linked with libarrow.so?

I built libarrow.so and pyarrow from source using gcc7.2 on Redhat 7.4. Still, I am stuck with the following error, which seems to be caused by using different version of gcc (4.8.5 vs. 7.2.0). [u0017649@sys-97675 ~]$ python Python 3.7.1 (default,…

undefined symbols pyarrow

asked Mar 26 '19 at 01:43

nasica88

1,185
10
10

votes

0 answers

Unable to concat dataframes - MemoryError

I am having an issue concatenating two dataframes. The strange part is that it worked- but just once - the first time, and after I made some "clever" changes (that I will discuss later), it did not do it again and started spewing a MemoryError. I…

python pandas dataframe parquet pyarrow

asked Mar 13 '19 at 10:49

Anonymous Person

1,437
8
26
47

votes

1 answer

Performing transformations on Arrow table

What kind of transformations can you apply to an Arrow table? Is its main use (for now) an interchange format for languages?

bigdata pyarrow apache-arrow

asked Mar 08 '19 at 02:09

marz

votes

0 answers

Cannot read (read_csv) from HDFS using Dask (FileNotFoundError: [Errno 2])

I have a cluster with installed hadoop: hadoop version Hadoop 3.1.1.3.0.1.0-187 Source code repository git@github.com:hortonworks/hadoop.git -r 2820e4d6fc7ec31ac42187083ed5933c823e9784 Compiled by jenkins on 2018-09-19T10:19Z Compiled with protoc…

python hadoop dask pyarrow

asked Feb 12 '19 at 15:40

Mikhail_Sam

10,602
11
66
102

votes

1 answer

inconsistent schema when reading parquet and exporting from Vertica

I've noticed weird behaviour when exporting data from Vertica and trying to read it later with parquet (python). Let's say I want to have table dump to parquet: EXPORT TO PARQUET (directory = '/data/table_name') over (partition by event_date) AS…

python export parquet vertica pyarrow

asked Feb 04 '19 at 13:53

Dmitriy Apollonin

1,418
2
16
30

votes

1 answer

Performance issue with Impala table with merged parquet files

Here I am having python utility to create multiple parquet files using Pyarrow library for Single data set as data set size is huge for one day. Here parquet file contains 10K parquet row groups in each split parquet file, here in end we are…

apache-spark hadoop parquet impala pyarrow

asked Jan 28 '19 at 19:30

Ajay Kharade

1,469
1
17
31

Prev 1 2 3

…

71 72 Next