Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
0
votes
0 answers

How to filter pyarrow pyarrow.lib.Date32Value values?

For example if you load a pyarrow parqetdataset, you can get at the data but is there an easy way of filtering this before converting to datetime.date ? datetime.date is a python object so would be good to have a fast way of cutting the data down…
mathtick
  • 6,487
  • 13
  • 56
  • 101
0
votes
1 answer

Converting Python seqence to arrow Array via C++ API

I'm attempting to investigate how Arrow converts a python list into an equivalent arrow::Array using the C++ API below. #include #include #include #include #include #include…
clery00
  • 251
  • 2
  • 14
0
votes
1 answer

pyarrow hdfs reads more data than requested

I am using pyarrow's HdfsFilesystem interface. When I call a read on n bytes, I often get 0%-300% more data sent over the network. My suspicion is that pyarrow is reading ahead. The pyarrow parquet reader doesn't have this behavior, and I am…
Iva
  • 1
  • 1
0
votes
1 answer

Unable to load parquet files with same columns names but with different order

Unable to load parquet files with same columns names but with a different order. Scenario: ABD-MacBook-Pro:ttt abd$ tree . ├── testing1.paquet └── testing2.paquet I have two parquet files as mentioned above. The column names are the same in both…
Naga Budigam
  • 689
  • 1
  • 10
  • 26
0
votes
2 answers

What could be the explaination of this "pyarrow.lib.ArrowIOError: HDFS file does not exist" error when trying to read files in hdfs using Dask?

I'm using Dask Distributed and I'm trying to create a dataframe from a CSV stored in HDFS. I suppose the connection to HDFS is successful as I'm able to print the dataframe columns' names. However, I get the following error when I'm trying to use…
Sevy
  • 15
  • 2
  • 6
0
votes
1 answer

koalas pip install fails on pyarrow dependency

I tried installing Databricks' new koalas package using the recommended pip install koalas on but it failed on the pyarrow install. I then installed pyarrow and retried koalas but it still failed on pyarrow. I visited the Github page which informed…
Frank B.
  • 1,813
  • 5
  • 24
  • 44
0
votes
1 answer

What could be the explanation of this 'pyarrow.lib.ArrowIOError'?

I'm working on an HDP cluster and I'm trying to read a .csv file from HDFS using pyarrow. I am able to connect to hdfs and print information about the file using the info() function. But when it comes to reading the content of the file, I get a…
Sevy
  • 15
  • 2
  • 6
0
votes
1 answer

RuntimeError pyarrow not installed

I install pyarrow 0.13.0 in a virtual environment on Ubuntu 16.04 using pip and it was successfully installed, but whenever I call it, I get the error below.…
Stella Ella
  • 1
  • 1
  • 4
0
votes
1 answer

Can Apache arrow support infinite level nested struct?

In this Apache arrow documentation page https://arrow.apache.org/docs/format/Metadata.html It seems to support it. Would some post some code to show infinite level nested struct please? Thanks.
0
votes
1 answer

How do I set which libstdc++.so to be linked with libarrow.so?

I built libarrow.so and pyarrow from source using gcc7.2 on Redhat 7.4. Still, I am stuck with the following error, which seems to be caused by using different version of gcc (4.8.5 vs. 7.2.0). [u0017649@sys-97675 ~]$ python Python 3.7.1 (default,…
nasica88
  • 1,185
  • 10
  • 10
0
votes
0 answers

Unable to concat dataframes - MemoryError

I am having an issue concatenating two dataframes. The strange part is that it worked- but just once - the first time, and after I made some "clever" changes (that I will discuss later), it did not do it again and started spewing a MemoryError. I…
Anonymous Person
  • 1,437
  • 8
  • 26
  • 47
0
votes
1 answer

Performing transformations on Arrow table

What kind of transformations can you apply to an Arrow table? Is its main use (for now) an interchange format for languages?
marz
  • 831
  • 1
  • 7
  • 12
0
votes
0 answers

Cannot read (read_csv) from HDFS using Dask (FileNotFoundError: [Errno 2])

I have a cluster with installed hadoop: hadoop version Hadoop 3.1.1.3.0.1.0-187 Source code repository git@github.com:hortonworks/hadoop.git -r 2820e4d6fc7ec31ac42187083ed5933c823e9784 Compiled by jenkins on 2018-09-19T10:19Z Compiled with protoc…
Mikhail_Sam
  • 10,602
  • 11
  • 66
  • 102
0
votes
1 answer

inconsistent schema when reading parquet and exporting from Vertica

I've noticed weird behaviour when exporting data from Vertica and trying to read it later with parquet (python). Let's say I want to have table dump to parquet: EXPORT TO PARQUET (directory = '/data/table_name') over (partition by event_date) AS…
Dmitriy Apollonin
  • 1,418
  • 2
  • 16
  • 30
0
votes
1 answer

Performance issue with Impala table with merged parquet files

Here I am having python utility to create multiple parquet files using Pyarrow library for Single data set as data set size is huge for one day. Here parquet file contains 10K parquet row groups in each split parquet file, here in end we are…
Ajay Kharade
  • 1,469
  • 1
  • 17
  • 31