Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

3 answers

Log parquet filenames created by pyarrow on S3

We are appending data to an existing parquet dataset stored in S3 (partitioned) by using pyarrow. This runs on AWS lambda several times per hour. A minimal example would be: import pyarrow as pa import pyarrow.parquet as pq import s3fs df = ... #…

asked Dec 02 '19 at 12:13

jarias

votes

0 answers

python ray - pyarrow.lib.ArrowInvalid: Maximum size exceeded (2GB)

I am trying to load and process large files using ray. I am using ray for the purpose of multiprocessing the files and improving the speed of the solution. I keep running into this pyarrow error: pyarrow.lib.ArrowInvalid: Maximum size exceeded…

python pyarrow ray apache-arrow apache-arrow-plasma

asked Dec 01 '19 at 13:30

nicjaymo

votes

1 answer

Is there a size limitation/problem when writing the apache arrow format with the java API

My arrow writer reading data from CSV files works fine for data less 1 GB, but stucks at about this limit (the writing code seems to block). I have enough memory given to the process (-Xmx12g) and the data size is about 1.2GB. A similar structured…

java limit apache-arrow

asked Sep 18 '19 at 06:07

ronstein2000

votes

1 answer

How to compress and decompress Arrow or Feather file?

I plan to change data file format from parquet to feather. Parquet has compression options(lz4, etc) and I have used them. But I can not find them in feather or Arrow file. Is compression not supported?

feather apache-arrow

asked Aug 28 '19 at 05:00

makepossible99

votes

1 answer

arrow file size is the same as csv?

I am trying to save a dataframe into .arrow format, mainly to get better size than CSV, to use that file to vega-lite I am using python import pandas import pyarrow as pa csv="C:/Users/mimoune.djouallah/data.csv" arrow…

python vega-lite apache-arrow

asked Mar 30 '19 at 08:07

Mim

votes

1 answer

What are the reasons for passing a pointers to a shared_ptr to a function?

I was looking at the C++ API of Apache's Arrow library, and noticed that it is littered with member functions that take arguments of type std::shared_ptr*. To me this looks unnecessarily contrived and possibly brittle, and it is frankly strange…

c++ smart-pointers apache-arrow

asked Jan 21 '19 at 01:04

masaers

votes

1 answer

Converting PySpark DataFrame to Pandas using Apache Arrow

I'd like to convert a PySpark DataFrame (pyspark.sql.DataFrame) to Pandas dataframe. There is a builtin method toPandas() which is very inefficient (Please read Wes McKinney'd article about this issue back in Fenruary 2017 here and his calculation…

pandas dataframe pyspark apache-arrow

asked Sep 07 '17 at 02:38

ahoosh

1,340
3
17
31

votes

0 answers

Unable to link pyarrow header files into a cpp pybind11 project with CMakeFiles.txt due to linkage error

So after reading the [ https://arrow.apache.org/docs/python/integration/extending.html ](Apache Python Extending Documentation), Arrow CPP no longer includes PyArrow header files and those header files are included in the PyArrow. I am able to…

c++ cmake pyarrow pybind11 apache-arrow

asked Jun 14 '23 at 10:05

See

votes

1 answer

R with package arrow sums results in NA at random

When using package arrow with R and dplyr, summarizing a variable results in NA at random, while there is no NA in the data. Example: library(arrow) library(dplyr) td <- tempdir() tzip <- file.path(td,…

r dplyr apache-arrow

asked Jun 13 '23 at 16:04

Eduardo Leoni

8,991
6
42
49

votes

1 answer

Partition events by time interval using Apache Arrow in Go

I'm trying to split some events that I'm collecting from Kafka based on a time interval. Basically, the goal is to read the value in the datetime column, run a simple formula to check if the current event falls in the current interval. If yes, then…

go partitioning apache-arrow

asked Jun 03 '23 at 20:00

spaghettifunk

1,936
4
24
46

votes

1 answer

Using the datatypes specified in datatype.go of golang apache arrow implementation for constructing a schema

I am learning apache Arrow and wanted to learn more about how to create a schema and an arrow record. For this I referenced some material but so far all of them just use the primitive types for building a schema like this:` schema :=…

go apache-arrow

asked Jun 02 '23 at 08:31

A Beginner

votes

2 answers

arrow R duration/difftime casting to float

I am working with a large set of datasets containing time-series. My time-series data include ID and a value for each day for several years (about 90Gb in total). What I am trying to do is to merge (Non-equi join) the time-series with a small…

r out-of-memory apache-arrow

asked May 19 '23 at 10:01

jmarkov

votes

1 answer

Multiple Arrow CSV Readers on same file returns null

I'm trying to read a the same file using multiple Goroutines, where each Goroutine is assigned a byte to start reading from and a number of lines to read lineLimit. I was successful in doing so when the file fits in memory by setting the…

go apache-arrow

asked May 01 '23 at 20:31

Mohamed Yasser

votes

1 answer

How to write anonymous functions in R arrow across

I have opened a .parquet dataset through the open_dataset function of the arrow package. I want to use across to clean several numeric columns at a time. However, when I run this code: start_numeric_cols = "sum" sales <- sales %>% mutate( …

r dplyr apache-arrow across

asked Apr 10 '23 at 10:22

Alberto Agudo Dominguez

votes

1 answer

How to write an arrow dataset based on a data.table grouping?

I have a dataset called df where I have year, month and day variables. I would like to use the write_dataset function to output a folder with the standard arrow dataset syntax as in the following image: Within each folder there will be month=1,…

r dplyr data.table apache-arrow

asked Apr 05 '23 at 15:58

Alberto Agudo Dominguez

Prev 1 2 3

…

39 40 Next