Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
0
votes
2 answers

connecting pyarrow with libhdfs3

I'm trying to connect to a hadoop cluster via pyarrows' HdfsClient / hdfs.connect(). I noticed pyarrows' have_libhdfs3() function, which returns False. How does one go about getting the required hdfs support for pyarrow? I understand there's a…
Jay
  • 2,535
  • 3
  • 32
  • 44
-1
votes
0 answers

attempting to read reddit pushshift .zst and convert to .parquet, code stops early with no error messages

For a research project, I need to convert reddit data from 2018 from .zst to parquet in order to share the data with my partners. In order ot do this, I modified, Watchful1's .zst-to.csv converter. I made the following changes: Line 67-72: create…
Tea Pod
  • 21
  • 1
  • 4
-1
votes
1 answer

Having issue installing 'Streamlit' with pip, I believe the failure is linked to Pyarrow and Cmake. I'm running MacOS High Sierra 10.13

I'm trying to install Streamlit with pip. I've tried using an virtual environment with Python version 3.9 and 3.10. Not sure if reinstalling Python will fix it...? Any help would be much appreciated, been stuck on this for 5 days now. Here is my the…
jgroshak
  • 3
  • 3
-1
votes
1 answer

Writing a parquet file from python that is compatible for SQL/Impala

I am trying to write a pandas Dataframe to a parquet file that is compatible with a table in Impala but am struggling to find a solution. My df has 3 columns code int64 number float name object When I create this into a parquet file and load it…
geds133
  • 1,503
  • 5
  • 20
  • 52
-1
votes
1 answer

Converting CSV to Parquet File format using Script in SAP BODS

BODS job is creating CSV Files. IS there a way to convert CSV Files to Parquet and Upload to S3 Bucket in SAP BODS. The Current approach i am using is below for Converting the CSV to Parquet Create a CSV File in the Folder that BODS is…
-1
votes
1 answer

Where do i find ParquetDatasetPiece class?

Reading the petastorm/etl/dataset_metadata.py script I found this code if row_groups_key != ".": for row_group in range(row_groups_per_file[row_groups_key]): rowgroups.append(pq.ParquetDatasetPiece( piece.path, …
-1
votes
1 answer

Store data frame as parquet with mixed datatypes in one column (timestamp and string)

I want to store a pandas data frame as Parquet file. But I got this error: pyarrow.lib.ArrowTypeError: ("object of type cannot be converted to int", 'Conversion failed for column foo with type object') The column has mixed data…
buhtz
  • 10,774
  • 18
  • 76
  • 149
-1
votes
1 answer

unable to import python modules using jython in java

I am using maven and this library to run python file org.python jython-standalone 2.5.2 and trying to run python file that contains import pyarrow.parquet as pq and it is giving me error as ImportError: No module named pyarrow. This module is…
K S
  • 1
  • 1
-1
votes
2 answers

Azure Function in Python get schema of parquet file

It is possible get schema of parquet file using Azure Function in Python without download file from datalake ? I using BlobStorageClient to connect to data lake and get the files and containers but i have no idea how can i dispatcher the command…
-2
votes
2 answers

How do I write this Python code to use 2+ fewer nested if statements?

I have the following code which I use to loop through row groups in a parquet metadata file to find the maximum values for columns i,j,k across the whole file. As far as I know I have to find the max value in each row group. I am looking for: how…
Roochiedoor
  • 887
  • 12
  • 19
-2
votes
2 answers

Convert to executable values in dictionary python

I have one dictionary named column_types with values as below. column_types = {'A': 'pa.int32()', 'B': 'pa.string()' } I want to pass the dictionary to pyarrow read csv function as below from pyarrow import csv table…
diwakar g
  • 99
  • 1
  • 2
  • 11
-4
votes
1 answer

i want to install streamlit but geting error in pyarrow

I want to install streamlit but getting error in pyarrow. Using like: cached pyarrow-1.0.1.tar.gz (1.3 MB) Installing build dependencies ... error Using cached pyarrow-1.0.1.tar.gz (1.3 MB) Installing build dependencies ... error ERROR: Command…
1 2 3
71
72