Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
3
votes
3 answers

Log parquet filenames created by pyarrow on S3

We are appending data to an existing parquet dataset stored in S3 (partitioned) by using pyarrow. This runs on AWS lambda several times per hour. A minimal example would be: import pyarrow as pa import pyarrow.parquet as pq import s3fs df = ... #…
jarias
  • 150
  • 2
  • 12
3
votes
0 answers

python ray - pyarrow.lib.ArrowInvalid: Maximum size exceeded (2GB)

I am trying to load and process large files using ray. I am using ray for the purpose of multiprocessing the files and improving the speed of the solution. I keep running into this pyarrow error: pyarrow.lib.ArrowInvalid: Maximum size exceeded…
3
votes
1 answer

Is there a size limitation/problem when writing the apache arrow format with the java API

My arrow writer reading data from CSV files works fine for data less 1 GB, but stucks at about this limit (the writing code seems to block). I have enough memory given to the process (-Xmx12g) and the data size is about 1.2GB. A similar structured…
3
votes
1 answer

How to compress and decompress Arrow or Feather file?

I plan to change data file format from parquet to feather. Parquet has compression options(lz4, etc) and I have used them. But I can not find them in feather or Arrow file. Is compression not supported?
3
votes
1 answer

arrow file size is the same as csv?

I am trying to save a dataframe into .arrow format, mainly to get better size than CSV, to use that file to vega-lite I am using python import pandas import pyarrow as pa csv="C:/Users/mimoune.djouallah/data.csv" arrow…
Mim
  • 999
  • 10
  • 32
3
votes
1 answer

What are the reasons for passing a pointers to a shared_ptr to a function?

I was looking at the C++ API of Apache's Arrow library, and noticed that it is littered with member functions that take arguments of type std::shared_ptr*. To me this looks unnecessarily contrived and possibly brittle, and it is frankly strange…
masaers
  • 697
  • 9
  • 21
3
votes
1 answer

Converting PySpark DataFrame to Pandas using Apache Arrow

I'd like to convert a PySpark DataFrame (pyspark.sql.DataFrame) to Pandas dataframe. There is a builtin method toPandas() which is very inefficient (Please read Wes McKinney'd article about this issue back in Fenruary 2017 here and his calculation…
ahoosh
  • 1,340
  • 3
  • 17
  • 31
2
votes
0 answers

Unable to link pyarrow header files into a cpp pybind11 project with CMakeFiles.txt due to linkage error

So after reading the [ https://arrow.apache.org/docs/python/integration/extending.html ](Apache Python Extending Documentation), Arrow CPP no longer includes PyArrow header files and those header files are included in the PyArrow. I am able to…
See
  • 31
  • 6
2
votes
1 answer

R with package arrow sums results in NA at random

When using package arrow with R and dplyr, summarizing a variable results in NA at random, while there is no NA in the data. Example: library(arrow) library(dplyr) td <- tempdir() tzip <- file.path(td,…
Eduardo Leoni
  • 8,991
  • 6
  • 42
  • 49
2
votes
1 answer

Partition events by time interval using Apache Arrow in Go

I'm trying to split some events that I'm collecting from Kafka based on a time interval. Basically, the goal is to read the value in the datetime column, run a simple formula to check if the current event falls in the current interval. If yes, then…
spaghettifunk
  • 1,936
  • 4
  • 24
  • 46
2
votes
1 answer

Using the datatypes specified in datatype.go of golang apache arrow implementation for constructing a schema

I am learning apache Arrow and wanted to learn more about how to create a schema and an arrow record. For this I referenced some material but so far all of them just use the primitive types for building a schema like this:` schema :=…
A Beginner
  • 393
  • 2
  • 12
2
votes
2 answers

arrow R duration/difftime casting to float

I am working with a large set of datasets containing time-series. My time-series data include ID and a value for each day for several years (about 90Gb in total). What I am trying to do is to merge (Non-equi join) the time-series with a small…
jmarkov
  • 191
  • 9
2
votes
1 answer

Multiple Arrow CSV Readers on same file returns null

I'm trying to read a the same file using multiple Goroutines, where each Goroutine is assigned a byte to start reading from and a number of lines to read lineLimit. I was successful in doing so when the file fits in memory by setting the…
Mohamed Yasser
  • 641
  • 7
  • 17
2
votes
1 answer

How to write anonymous functions in R arrow across

I have opened a .parquet dataset through the open_dataset function of the arrow package. I want to use across to clean several numeric columns at a time. However, when I run this code: start_numeric_cols = "sum" sales <- sales %>% mutate( …
2
votes
1 answer

How to write an arrow dataset based on a data.table grouping?

I have a dataset called df where I have year, month and day variables. I would like to use the write_dataset function to output a folder with the standard arrow dataset syntax as in the following image: Within each folder there will be month=1,…