Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
2
votes
2 answers

How can I get the row view of data read from parquet file?

Example: Let's say a table name user has id, name, email, phone, and is_active as attributes. And there are 1000s of users part of this table. I would like to read the details per user. void ParquetReaderPlus::read_next_row(long row_group_index,…
Shravan40
  • 8,922
  • 6
  • 28
  • 48
2
votes
1 answer

How to cast pyarrow timestamp dtype to time64 type?

I'm trying to cast pyarrow timestamp type of time64 type. But it's showing cast error. import pyarrow as pa from datetime import datetime dt = datetime.now() table = pa.Table.from_pydict({'ts': pa.array([dt, dt])}) new_schema = table.schema.set(0,…
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
2
votes
1 answer

Data compression using Arrow.jl in Julia

I tried to compress data using Arrow.jl. However, the test run using the below code didn’t show any size reduction (or compression). May I seek advice on my implementation, like is there something I am doing wrong? Code: using CSV, DataFrames,…
Mohammad Saad
  • 1,935
  • 10
  • 28
2
votes
0 answers

How to build an Apache Arrow message containing a list of structs with arrow-rs?

I'm using the arrow-rs crate (version 4.4) to declared the following schema: Schema::new(vec![ Field::new("name", DataType::Utf8, false), Field::new("attributes", DataType::List( Box::new(Field::new( …
lquerel
  • 153
  • 1
  • 8
2
votes
1 answer

How to initialise a fixed-size ListArray in pyarrow from a numpy array efficiently?

How would I efficiently initialise a fixed-size pyarray.ListArray from a suitably prepared numpy array? The documentation of pyarray.array indicates that a nested iterable input structure works, but in practice that does not work if the outer…
burnpanck
  • 1,955
  • 1
  • 12
  • 36
2
votes
1 answer

Writing a Vec of Rows to a Parquet file

I know how to read a Parquet file into a Vec. extern crate parquet; use parquet::file::reader::{FileReader, SerializedFileReader}; use std::{fs, sync::Arc}; use parquet::column::writer::ColumnWriter; use parquet::{ file::{ …
tsorn
  • 3,365
  • 1
  • 29
  • 48
2
votes
3 answers

What is the best way of using arrow parquet in more modern cmake?

Below is the solution that worked for me, but not sure if it is the best way to do this. I used brew to install it. vcpkg does not work at the moment, unfortunately. What I don't like about this solution is that I need to set Parquet_DIR and…
Amir
  • 189
  • 2
  • 12
2
votes
1 answer

Error: Invalid: Unrecognized filesystem type in URI when loading parquet file from url using arrow package

I'm pretty new to parquet file format and I'm using the read_parquet() (in the arrow package) to load parquet file (stored in my Dropbox share folder) into R. However, I received the following error message library(arrow) df <-…
Chris T.
  • 1,699
  • 7
  • 23
  • 45
2
votes
0 answers

Write C++ data to Apache Parquet: ParquetFileWriter or Write Arrow Table?

I'm looking for the proper way to write data to a Parquet file in Cpp/C++. It seems like there are two choices: either writing direct to Parquet or writing to Arrow then Parquet. Is writing to Arrow then converting to Parquet with WriteTable…
user2183336
  • 706
  • 8
  • 19
2
votes
1 answer

apache-arrow does not compile with typescript

I posted this question for the @apache-arrow/ts library as well. I've been able to get this to bundle with webpack, but I've been considering using rollup instead for other issues I'm having with my library. However, that requires me to do a tsc…
westandy
  • 1,360
  • 2
  • 16
  • 41
2
votes
2 answers

MethodError when trying to get a row from an Arrow Dataframe in Julia

I have a dataset that looks like this: I am taking a CSV file, converting it to Parquet and then sending it to Arrow. There is a reason why I am doing it like this. My goal is to get access to the information in row "Algeria". This is my code: df =…
Onur-Andros Ozbek
  • 2,998
  • 2
  • 29
  • 78
2
votes
0 answers

How can I get "year" "month" "date" from timestamp in pyarrow?

I am trying to extract the "year" "month" "date" from the arrows timestamp[s] type. I know how to do it in pandas, as follows import pyarrow.dataset as ds dataset = ds.dataset(path, format="csv") table = dataset.to_table() ## following codes wont…
Xion
  • 319
  • 2
  • 11
2
votes
1 answer

How to change column datatype with pyarrow

I am reading a set of arrow files and am writing them to a parquet file: import pathlib from pyarrow import parquet as pq from pyarrow import feather import pyarrow as pa base_path = pathlib.Path('../mydata') fields = [ pa.field('value',…
ARF
  • 7,420
  • 8
  • 45
  • 72
2
votes
1 answer

Can I add a new column without rewriting an entire file?

I've been experimenting with Apache Arrow. I have used the column oriented memory mapped files for many years. In the past, I've used a separate file for each column. Arrow seems to like to store everything in one file. Is there a way to add a…
2
votes
1 answer

Apache Arrow Bus Error/Seg Fault when using Python bindings

I am writing data to parquet files. Apache Arrow provides a straightforward example for doing this: parquet-arrow, in which the data flow is essentially: data => arrow::ArrayBuilder => arrow::Array => arrow::Table => parquet file. This works fine as…
AJ Donich
  • 31
  • 5