Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
2
votes
1 answer

Arrow Unsigned S3 Requests

I'm hoping to read a multi-file dataset from a public s3 bucket using the R package. Is it possible to not sign requests for objects like these? In the AWS CLI, that's done with the --no-sign-request argument. Is there any way that this can be…
2
votes
0 answers

Create small file with dates and timestamps from c++ api

I am trying to create, from a c++ program, a .parquet file as small as possible. I would like to use the parquet::StreamWriter. TL;DR: What are the best compression / encoder settings and how should the columns (parquet::schema::PrimitiveNode) be…
Weatherwax
  • 21
  • 1
  • 3
2
votes
1 answer

How to connect to HDFS from R and read/write parquets using arrow?

I have couple of parquet files in HDFS that I'd like to read into R and some data in R I'd like to write into HDFS and store in parquet file format. I'd like to use arrow library, because I believe it's the R equivalent of pyarrow and pyarrow is…
ira
  • 2,542
  • 2
  • 22
  • 36
2
votes
1 answer

Is there a method to control Apache Arrow Batch Sizes?

I'd like to understand if there's a mechanism to control batch sizes being sent from server to client. I've implemented the python server from the Github repo and a basic F# client. As a test, I've added a flight containing 1 million rows which I'd…
2
votes
0 answers

Difference between table and datasets API in arrow

From the documentation, I understand that arrow provides the datasets API to deal with the bigger data than memory. Both have the capability for the automatic predicate/projection pushdown features (which makes it deal with greater than in-memory…
Xion
  • 319
  • 2
  • 11
2
votes
1 answer

arrow parquet partitioning, multiple datasets in same directory structure in R

I have a multiple datasets stored in a partitioned parquet format using the same partitioning file structure, e.g. the directory structure is…
Matt SM
  • 235
  • 3
  • 13
2
votes
0 answers

How to adapt Arrow.Table columns (naturally per record batch basis) into CuArrays for GPU processing?

(Also asked as an issue at Arrow.jl) I'm figuring out ways to have table columns processed by GPU, those coming from "arrow file" format files mmaped for zero-copy. The full series can not fit into GRAM, while each batch record can, so one thing is…
Compl Yue
  • 164
  • 1
  • 3
  • 16
2
votes
1 answer

r arrow set column type/schema to char for all columns

{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. As the dataset is quite wide (a few hundred cols)…
Rob G.
  • 395
  • 1
  • 8
2
votes
1 answer

pyarrow read_csv - how to fill trailing optional columns with nulls

I can't find an option or workaround for this using pyarrow.csv.read_csv and there are many other reasons why using pandas doesn't work for us. We have csv files with final columns that are effectively optional, and the source data doesn't always…
Kevin Crouse
  • 43
  • 1
  • 7
2
votes
0 answers

Error loading native librarie libarrow_dataset_jni.dylib in Apache Arrow 7.0.0 Java Dataset API

I got this error when I tried to run a simple Java program that uses the Dataset API available in version 7.0.0. The error occurred when trying to get a NativeMemoryPool using the getDefault() method. . . . NativeMemoryPool nativeMemoryPool =…
CEPEL SOMA
  • 98
  • 6
2
votes
0 answers

Convert Apache Arrow Table to string, the import on Table not working

I'm following this example: https://arrow.apache.org/docs/js/ import { readFileSync } from 'fs'; import { Table } from 'apache-arrow'; const arrow = readFileSync('simple.arrow'); const table =…
2
votes
1 answer

Converting Apache Arrow Table to RecordBatch in c++

I would like to obtain a std::shared_ptr from an std::shared_ptr as std::shared_ptr table = ... auto rb = std::RecordBatch::Make(table->schema(), table->num_rows(),…
marital_weeping
  • 618
  • 5
  • 18
2
votes
1 answer

pyarrow Table to PyObject* via pybind11

#include #include #include #include #include // Convert pyarrow table to native C++ object and print its contents void print_table(PyObject* py_table) { //…
marital_weeping
  • 618
  • 5
  • 18
2
votes
1 answer

Write Apache Arrow table to string C++

I'm trying to write an Apache Arrow table to a string. My big example has problems and I can't get this little example to work. This one segfaults inside of Arrow in the WriteTable call. My bigger example doesn't appear to serialize…
user2183336
  • 706
  • 8
  • 19
2
votes
1 answer

Read a partitioned parquet dataset from multiple files with PyArrow and add a partition key based on the filename

I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer. I can read them all and subsequently convert to a pandas dataframe: files =…
Cedric H.
  • 7,980
  • 10
  • 55
  • 82