Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead.
Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

For installation details see this

595 questions

votes

1 answer

Arrow Unsigned S3 Requests

I'm hoping to read a multi-file dataset from a public s3 bucket using the R package. Is it possible to not sign requests for objects like these? In the AWS CLI, that's done with the --no-sign-request argument. Is there any way that this can be…

r apache-arrow

asked Jun 23 '22 at 02:59

Brian Connelly

votes

0 answers

Create small file with dates and timestamps from c++ api

I am trying to create, from a c++ program, a .parquet file as small as possible. I would like to use the parquet::StreamWriter. TL;DR: What are the best compression / encoder settings and how should the columns (parquet::schema::PrimitiveNode) be…

c++ parquet apache-arrow

asked May 22 '22 at 11:20

Weatherwax

votes

1 answer

How to connect to HDFS from R and read/write parquets using arrow?

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is…

r hadoop hdfs pyarrow apache-arrow

asked May 16 '22 at 07:31

ira

2,542
2
22
36

votes

1 answer

Is there a method to control Apache Arrow Batch Sizes?

I'd like to understand if there's a mechanism to control batch sizes being sent from server to client. I've implemented the python server from the Github repo and a basic F# client. As a test, I've added a flight containing 1 million rows which I'd…

python f# pyarrow apache-arrow

asked Mar 09 '22 at 12:03

Christopher Dunderdale

votes

0 answers

Difference between table and datasets API in arrow

From the documentation, I understand that arrow provides the datasets API to deal with the bigger data than memory. Both have the capability for the automatic predicate/projection pushdown features (which makes it deal with greater than in-memory…

pyspark dataset parquet pyarrow apache-arrow

asked Mar 07 '22 at 23:19

Xion

votes

1 answer

arrow parquet partitioning, multiple datasets in same directory structure in R

I have a multiple datasets stored in a partitioned parquet format using the same partitioning file structure, e.g. the directory structure is…

r parquet apache-arrow

asked Mar 04 '22 at 18:32

Matt SM

votes

0 answers

How to adapt Arrow.Table columns (naturally per record batch basis) into CuArrays for GPU processing?

(Also asked as an issue at Arrow.jl) I'm figuring out ways to have table columns processed by GPU, those coming from "arrow file" format files mmaped for zero-copy. The full series can not fit into GRAM, while each batch record can, so one thing is…

julia gpu apache-arrow

asked Mar 02 '22 at 16:40

Compl Yue

votes

1 answer

r arrow set column type/schema to char for all columns

{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. As the dataset is quite wide (a few hundred cols)…

r apache-arrow

asked Mar 01 '22 at 08:04

Rob G.

votes

1 answer

pyarrow read_csv - how to fill trailing optional columns with nulls

I can't find an option or workaround for this using pyarrow.csv.read_csv and there are many other reasons why using pandas doesn't work for us. We have csv files with final columns that are effectively optional, and the source data doesn't always…

csv parsing pyarrow apache-arrow

asked Feb 24 '22 at 16:42

Kevin Crouse

votes

0 answers

Error loading native librarie libarrow_dataset_jni.dylib in Apache Arrow 7.0.0 Java Dataset API

I got this error when I tried to run a simple Java program that uses the Dataset API available in version 7.0.0. The error occurred when trying to get a NativeMemoryPool using the getDefault() method. . . . NativeMemoryPool nativeMemoryPool =…

java java-native-interface apache-arrow

asked Feb 09 '22 at 13:10

CEPEL SOMA

votes

0 answers

Convert Apache Arrow Table to string, the import on Table not working

I'm following this example: https://arrow.apache.org/docs/js/ import { readFileSync } from 'fs'; import { Table } from 'apache-arrow'; const arrow = readFileSync('simple.arrow'); const table =…

reactjs apache-arrow

asked Feb 03 '22 at 13:33

Mauricio Etchevest

votes

1 answer

Converting Apache Arrow Table to RecordBatch in c++

I would like to obtain a std::shared_ptr from an std::shared_ptr as std::shared_ptr table = ... auto rb = std::RecordBatch::Make(table->schema(), table->num_rows(),…

c++ apache-arrow

asked Jan 13 '22 at 06:51

marital_weeping

votes

1 answer

pyarrow Table to PyObject* via pybind11

#include #include #include #include #include // Convert pyarrow table to native C++ object and print its contents void print_table(PyObject* py_table) { //…

c++ pybind11 pyarrow apache-arrow

asked Jan 07 '22 at 11:26

marital_weeping

votes

1 answer

Write Apache Arrow table to string C++

I'm trying to write an Apache Arrow table to a string. My big example has problems and I can't get this little example to work. This one segfaults inside of Arrow in the WriteTable call. My bigger example doesn't appear to serialize…

c++ apache-arrow apache-arrow-cpp

asked Oct 19 '21 at 15:06

user2183336

votes

1 answer

Read a partitioned parquet dataset from multiple files with PyArrow and add a partition key based on the filename

I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer. I can read them all and subsequently convert to a pandas dataframe: files =…

python parquet pyarrow apache-arrow

asked Sep 29 '21 at 15:12

Cedric H.

7,980
10
55
82

Prev 1 2 3

…

39 40 Next