Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
2
votes
0 answers

The error of read more columns than the original data in r::arrow

The original dataset should only contain 28 columns but arrow returns 29 columns. The code is as below: schema1 = arrow::schema( Key =int64(), Sex = string(), Age = int64(), …
doraemon
  • 439
  • 3
  • 10
2
votes
0 answers

Apache arrow C++ Parquet, how to read and decode min and max values statistics

I'm writing a program using Apache Arrow C++ library to extract metadata from a parquet file, and I've been having a lot of trouble finding documentation and examples. After some try and error I managed to do the job using this…
Alberto Pires
  • 319
  • 1
  • 5
  • 10
2
votes
1 answer

Identify partitioning variable in parquet file

Is there an easy way of identifying the variable that was used to partition a parquet dataset? As an example, below I create a toy parquet using the mtcars dataset. # Load library library(arrow) # Write data to parquet mtcars |>…
Dan
  • 11,370
  • 4
  • 43
  • 68
2
votes
1 answer

How to prevent arrow from pulling data into R when a binding is not found for a given function?

I wonder if there is a way to prevent arrow from pulling data into R by default when it cannot find a suitable binding. So that instead of getting the following warning message pulling data into R, arrow will throw an error instead. Is there an…
andreranza
  • 93
  • 6
2
votes
2 answers

R + Arrow 10 : convert blank to numeric NA

Please have a look at the reprex at the end of the post. I need to read a column as a string, perform several manipulations and then save convert it to a numerical column. The blanks ("") in the string column give me a headache because arrow does…
larry77
  • 1,309
  • 14
  • 29
2
votes
0 answers

JS apache-arrow tableFromIPC's supported compression method/level?

I'm using file systems (local, Google Cloud Storage, and maybe S3) to exchange data between the web front end (JS) and back end (Python). After writing Arrow IPC data format to file systems using the Python back end like below: with…
Yan Yang
  • 1,804
  • 2
  • 15
  • 37
2
votes
0 answers

R+Arrow 10.0: Bug when Using gsub?

A bit of a follow up question of R+arrow: Error when using the dataset API Please have a look at the reprex at the end of the post. Essentially, I work on a data file without loading it into memory and I want to replace "" with "0" in a string. In…
larry77
  • 1,309
  • 14
  • 29
2
votes
1 answer

R, How to refer to variable name as string, in function that uses arrow:open_dataset internally

Trying to create a function that will compute the average of some variable, whose name is provided in the function. For instance: mean_of_var <- function(var){ open_dataset('myfile') %>% summarise(meanB=mean(get(var) ,na.rm = T), …
LucasMation
  • 2,408
  • 2
  • 22
  • 45
2
votes
1 answer

Can we read a parquet file and partition file in java arrow similar to pyarrow?

I have been trying to implement below pyarrow code in java but could not find anything. can you please suggest is it even possible to implement below code in java arrow or is there any alternative library to achieve this table1 =…
2
votes
0 answers

How to read csv with \" within quoted string with read_csv_arrow

I have a large csv file that I'd like to read with arrow::read_csv_arrow(). However, the file contains quoted strings. readr::read_delim() is able to read the file (given correct settings), while arrow::read_csv_arrow() is…
Thomas K
  • 3,242
  • 15
  • 29
2
votes
1 answer

How do I use generics in Apache Arrow?

Say I have a function called boop. It has different behaviour depending on the class of its argument, so I use generics, like so: library(dplyr) df <- data.frame(a = c("these", "are", "some", "strings"), b = 1:4) boop <-…
Dan
  • 11,370
  • 4
  • 43
  • 68
2
votes
1 answer

Summarise before collecting in arrow using strings for column names

Say I want to summarise a column in an arrow table prior to collecting (because the actual dataset is larger than memory). I could do something like this: arrow_table(mtcars) %>% summarise(mean(mpg)) %>% collect() # A tibble: 1 × 1 # …
Dan
  • 11,370
  • 4
  • 43
  • 68
2
votes
0 answers

How can I use R Arrow and AWS S3 in a shiny app deployed on EC2 with shinyproxy

I have been testing out the apache-arrow R package to fetch data from S3 (parquet files) for some shiny apps and have had some success. However, while everything works as expected during local development, after deploying to shinyproxy on an EC2…
2
votes
0 answers

Set as `NA` when Arrow's schema cannot parse values of a CSV in R

I am trying to read a csv (~ 18,000,000 rows, ~ 1000 columns) into arrow (in R) with open_dataset pre-specifying a schema. There are some instances in which the csv was generated incorrectly and some values don't match the intended schema (say some…
Rodrigo Zepeda
  • 1,935
  • 2
  • 15
  • 25
2
votes
2 answers

How to get columns data from golang apache-arrow?

I am using apache-arrow/go to read parquet data. I can parse the data to table by using apach-arrow. reader, err := ipc.NewReader(buf, ipc.WithAllocator(alloc)) if err != nil { log.Println(err.Error()) return nil } …
Pccc
  • 47
  • 4