Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
3
votes
1 answer

How can I parse timestamp with time zone?

What I am trying to do I am using Py Arrow to parse data from a csv (originally from a Postgres database). I am having issues parsing a timestamp (with a timezone) that looks like 2017-08-19 14:22:11.802755+00. I am then receiving an error that…
alt-f4
  • 2,112
  • 17
  • 49
3
votes
0 answers

Access individual elements of ChunkedArray by its index within column

What is the best method to randomly access individual elements ("Scalars") of arrow::ChunkedArray e.g. for testing and display purposes? Is there some equivalent method to Array::GetScalar which takes into account that the ChunkedArray consists of…
VolkerM
  • 316
  • 2
  • 5
3
votes
0 answers

R and RStudio crashes running read_parquet() on Mac M1

As the title states, both R and RStudio crash with a 'fatal error' when I try to run read_parquet('abc.parquet') For reference, read_parquet() is a function from the arrow() library Using: Macbook Pro M1 2020 Macbook Pro M1 2020 R version 4.1.0 (I…
gmarais
  • 1,801
  • 4
  • 16
  • 32
3
votes
1 answer

What is a common use case for Apache arrow in a data pipeline built in Spark

What is the purpose of Apache Arrow? It converts from one binary format to another, but why do i need that? If I have a spark program,then spark can read parquet,so why do i need to convert it into another format,midway through my processing? Is it…
Victor
  • 16,609
  • 71
  • 229
  • 409
3
votes
0 answers

Apache Arrow Flight: Multiple calls to FlightServer

I've been following this tutorial on how to set up and use Apache Arrow Flight. From the example, server.py: import pyarrow as pa import pyarrow.flight as fl def create_table_int(): data = [ pa.array([1, 2, 3]), pa.array([4, 5,…
ajp619
  • 670
  • 7
  • 11
3
votes
1 answer

call StructArray.from_arrays specifying a missing value mask

I'm trying to create a pyarrow.StructArray with missing values. I works fine when I use pyarrow.array passing tuples representing my records: >>> pyarrow.array( [ None, (1, "foo"), ], type=pyarrow.struct( …
0x26res
  • 11,925
  • 11
  • 54
  • 108
3
votes
1 answer

How do you tell the Apache Arrow Format Version for a given Library Version?

Apache Arrow in their documentation list that each release has two versions, a Library Version and a Format Version: https://arrow.apache.org/docs/format/Versioning.html It appears that over the last year there have been 4 Library Versions, but it's…
3
votes
1 answer

Error occurs when debugging rust program with vscode (windows only)

I am trying to debug the code below with vscode, but an error occurs. Development environment Microsoft Windows 10 Home 10.0.19042 Build 19042 rustc 1.49.0 (e1884a8e3 2020-12-29) Vscode 1.54.3 CodeLLDB v1.6.1 //…
3
votes
1 answer

How to correctly read an Apache Arrow Feather file produced by pyarrow?

I have been unsuccessful to read an Apache Arrow Feather with javascript produced by a python script javascript library of Arrow.. I am using pyarrow and arrow/js from the Apache Arrow project. I created a simple python script to create the Feather…
ToniR
  • 33
  • 7
3
votes
1 answer

How to read column names and metadata from feather files in R arrow?

The (now-superseded) stand-alone feather library for R had a function called feather_metadata() that allowed to read column names and types from feather files on disk, without opening them. This was useful to select only specific columns when…
MatteoS
  • 745
  • 2
  • 6
  • 17
3
votes
1 answer

Comparison of protobuf and arrow

Both are language-neutral and platform-neutral data exchange libraries. I wonder what are the difference of them and which library is good for which situations.
Benjamin Du
  • 1,391
  • 1
  • 17
  • 25
3
votes
0 answers

Issue with writing Parquet Files via Arrow Package in R

Just wondering if there's a difference in the read/write parquet function from the arrow package in R when running in Windows vs Linux OS? Example code(insert anything in dataframe): mydata = data.frame(...) write_parquet(mydata,…
3
votes
0 answers

Efficient way to calculate area of a 2D polygon in Pyspark for N rows in a group-by

I have a dataframe in pyspark (I get it from reading in a partition with around 1.6 million rows, but often I read in multiple partitions). For each week of data, there are ~200,000 different timestamps and for each timestamp there are up to 8…
3
votes
0 answers

pyarrow convert string to dict array in table without going to pandas

I have a daily process where I read in a historical parquet dataset and then concatenate that with a new file each day. I'm trying to optimize memory by making better use of arrows dictionary arrays. I want to avoid doing round trip to pandas…
matthewmturner
  • 566
  • 7
  • 21
3
votes
1 answer

Building Apache Arrow inside existing C++ Executable project CMAKE

I'm working on a C++ CMake project that uses Apache Arrow as a dependency. My goal is to be able to include and use arrow/api.h. However, I couldn't find any documentation or tutorial that explains what I can do to achieve that so my first thought…
eyadMhanna
  • 2,412
  • 3
  • 31
  • 49