Questions tagged [apache-arrow]

Apache Arrow™ enables execution engines to take advantage of the latest SIM D (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing.

For installation details see this

595 questions
2
votes
1 answer

Converting Arbitrary Objects into Bytes in Python3

My goal is to feed an object that supports the buffer protocol into hashlib's sha2 generator such that sha2 hashes generated from the same underlying data in different execution environments are consistent, and so can be used for equality tests. I…
Alex Flanagan
  • 557
  • 4
  • 9
2
votes
2 answers

Is there a way to deal with embedded nuls while reading in parquet files?

I have data scraped from the internet (hence varied encodings) and stored as parquet files. While processing it in R I use the arrow library. For the following code…
Akash21795
  • 61
  • 1
  • 12
2
votes
1 answer

Write Parquet MAP datatype by PyArrow

I'm writing in Python and would like to use PyArrow to generate Parquet files. Per my understanding and the Implementation Status, the C++ (Python) library already implemented the MAP type. From the Data Types, I can also find the type…
Yucan
  • 21
  • 3
2
votes
1 answer

Reading schema & metadata from a parquet file

I am reading a third-party parquet file using parquetjs-lite const parquet = require("parquetjs-lite"); : reader = await parquet.ParquetReader.openFile(fileName); cursor = reader.getCursor() : I am able to read the records (and rowCount) but how…
user14013917
  • 149
  • 1
  • 10
2
votes
0 answers

Error: "as_tibble not exported by namespace arrow" with Apache Arrow on Databricks using R

I am working with R on (Azure) Databricks and wanted to enable Apache Arrow for I/O. However, using below sample code, I'm getting some weird errow that I cannot trace back. The error is occurring on clusters using Databricks runtime ML7.0 (Spark…
K.O.T.
  • 111
  • 10
2
votes
0 answers

Read Parquet file in to array of C++ structs

Originally I was writing and reading C++ struct data to file as binary, using reinterpet_cast. This was good because no code changes were required when a new member was added. The cast handled it automatically. I'm now writing to a Parquet file…
user997112
  • 29,025
  • 43
  • 182
  • 361
2
votes
0 answers

Updating Parquet datasets where the schema changes overtime

I have a single parquet file that I have been incrementally been building every day for several months. The file size is around 1.1GB now and when read into memory it approaches my PCs memory limit. So, I would like to split it up into several…
matthewmturner
  • 566
  • 7
  • 21
2
votes
1 answer

Install Apache Arrow Java in Eclipse

I'm currently trying to install Apache Arrow for Java in Eclipse and having some troubles. I've found the Java Packages on https://search.maven.org/search?q=g:org.apache.arrow%20AND%20v:0.17.1 Because I didn't find any information about the…
G.M
  • 530
  • 4
  • 20
2
votes
0 answers

Using apache-arrow in a browser application - Typescript compiler errors

Attempting to use apache-arrow within a browser application, but typescript compiler throws the following errors in some of arrow's .d.ts files import { Table } from "../node_modules/@apache-arrow/es2015-esm/Arrow"; export class SomeClass…
shyamals
  • 21
  • 1
2
votes
1 answer

Pyarrow table memory compared to raw csv size

I have a 2GB CSV file that I read into a pyarrow table with the following: from pyarrow import csv tbl = csv.read_csv(path) When I call tbl.nbytes I get 3.4GB. I was surprised at how much larger the csv was in arrow memory than as a csv. Maybe…
matthewmturner
  • 566
  • 7
  • 21
2
votes
1 answer

TypeError: field Customer: Can not merge type and

SL No: Customer Month Amount 1 A1 12-Jan-04 495414.75 2 A1 3-Jan-04 245899.02 3 A1 15-Jan-04 259490.06 My Df is above Code import findspark findspark.init('/home/mak/spark-3.0.0-preview2-bin-hadoop2.7') import pyspark from…
user6882757
2
votes
2 answers

Add a subproject by CMake

Apache Arrow submodule is stored at thirdparty/apache_arrow/cpp, so my main CMakeLists.txt looks like cmake_minimum_required(VERSION 3.0.0) project(arrow_parcer VERSION 0.1.0) add_subdirectory(src) add_subdirectory(thirdparty/apache_arrow/cpp) At…
2
votes
2 answers

Is there Spark Arrow Streaming = Arrow Streaming + Spark Structured Streaming?

Currently we have spark structured streaming In arrow doc, I found arrow streaming, where we can create a stream in Python, produce the data, and use StreamReader to consume the stream in Java/Scala I am wondering if there is integration of these…
2
votes
0 answers

Pyarrow table create column from existing columns

Is there a way to use append_column to create a column based on columns that currently exist in a pyarrow table? I want to create a pa.struct() field using columns that already exist. Looking for something along the lines of the following: pa_table…
R.Z.
  • 101
  • 6
2
votes
1 answer

How to convert PyArrow table to Arrow table when interfacing between PyArrow in python and Arrow in C++

I have a C++ library which is built against the Apache Arrow C++ libraries, with a binding to python using Pybind. I'd like to be able to write a function in C++ to take a table constructed with PyArrow, like: void test(arrow::Table test); Passing…
Tim P
  • 415
  • 3
  • 11