Questions tagged [pyarrow]

pyarrow is a Python interface for Apache Arrow

About:

pyarrow provides the Python API of Apache Arrow.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Resources:

1078 questions
5
votes
1 answer

"Raise RuntimeError('Not supported on 32-bit Windows')" when installing pyarrow

I get this error whenever I try to install pyarrow on my PC. It is 64bit so I don't understand it: raise RuntimeError('Not supported on 32-bit Windows') RuntimeError: Not supported on 32-bit Windows ---------------------------------------- …
WorkDoubts
  • 63
  • 1
  • 4
5
votes
1 answer

How to use Pandas UDFs on macOS Mojave? (that fails due to [__NSPlaceholderDictionary initialize] may have been in progress...)

I'm trying to use Pandas UDFs (a.k.a. Vectorized UDFs) in Apache Spark 2.4.0 on macOS 10.14.3 (macOS Mojave). I installed pandas and pyarrow using pip (and later pip3). Whenever I execute the sample code from the official documentation of Spark SQL…
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
5
votes
2 answers

Read CSV with PyArrow

I have large CSV files that I'd ultimately like to convert to parquet. Pandas won't help because of memory constraints and its difficulty handling NULL values (which are common in my data). I checked the PyArrow docs and there are tools for…
dudemonkey
  • 1,091
  • 5
  • 15
  • 26
5
votes
2 answers

Python pandas_udf spark error

I started playing around with spark locally and finding this weird issue 1) pip install pyspark==2.3.1 2) pyspark> import pandas as pd from pyspark.sql.functions import pandas_udf, PandasUDFType, udf df = pd.DataFrame({'x':…
Shrikar
  • 840
  • 1
  • 8
  • 30
5
votes
1 answer

Can I [de]serialize a dictionary of dataframes in the arrow/js implementation?

I want to use Apache Arrow to send data from a Django backend to a Angular frontend. I want to use a dictionary of dataframes/tables as payload in messages. It's posssible with pyarrow to share data in this way between python microservices, but i…
gabomgp
  • 769
  • 1
  • 10
  • 23
5
votes
4 answers

How to efficiently read rows from Google BigTable into a pandas DataFrame

Use case: I am using Google BigTable to store counts like this: | rowkey | columnfamily | | | col1 | col2 | col3 | |----------|------|------|------| | row1 | 1 | 2 | 3 | | row2 | 2 | 4 | 8 | | row3 | 3 …
bartaelterman
  • 795
  • 10
  • 26
5
votes
1 answer

Are parquet file created with pyarrow vs pyspark compatible?

I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing df.repartition(*partitionby).write.partitionBy(partitionby). …
siberiancrane
  • 586
  • 1
  • 6
  • 20
5
votes
1 answer

hdfs.connect() vs HdfsClient in PyArrow

I apologize if this is a noob question, but I couldn't find any relevant reference - what is the difference between these two? If I'd like to read parquet files from hdfs using pyarrow, which one would I use?
Jay
  • 2,535
  • 3
  • 32
  • 44
4
votes
3 answers

Connect python-polars to SQL server (no support currently)

How can I directly connect MS SQL Server to polars? The documentation does not list any supported connections but recommends the use of pandas. Update: SQL Server Authentication works per answer, but Windows domain authentication is not working. see…
Isaacnfairplay
  • 217
  • 2
  • 18
4
votes
3 answers

ModuleNotFoundError: No module named 'pyarrow.lib'

This is the full error message. Traceback (most recent call last): File "C:\Users\adi\OneDrive\Desktop\Python310\machine learning project.py", line 3, in import streamlit as st File…
Addy
  • 61
  • 2
  • 6
4
votes
0 answers

How to convert pyarrow.Table to PySpark Dataframe?

I have a pyarrow.Table object that I want to pass to PySpark (and save as a Spark table). How can I convert pyarrow.Table to pyspark.sql.DataFrame? The only way I can see it to convert it to pandas.DataFrame, but aren't there some more direct and…
Felix
  • 3,351
  • 6
  • 40
  • 68
4
votes
2 answers

How to sort a Pyarrow table?

How do I sort an Arrow table in PyArrow? There does not appear to be a single function that will do this, the closest is sort_indices.
Contango
  • 76,540
  • 58
  • 260
  • 305
4
votes
3 answers

How would I go about converting a .csv to an .arrow file without loading it all into memory?

I found a similar question here: Read CSV with PyArrow In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name it. I can't seem to find the exact…
kasbah512
  • 43
  • 4
4
votes
1 answer

Why does Dask seem to store Parquet inefficiently

When I save the same table using Pandas and Dask into Parquet, Pandas creates a 4k file, wheres Dask creates a 39M file. Create the dataframe import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import dask.dataframe as dd n =…
Dahn
  • 1,397
  • 1
  • 10
  • 29
4
votes
2 answers

Read last N rows of S3 parquet table

If I apply what was discussed here to read parquet files in an S3 buck to pandas dataframe, particularly: import pyarrow.parquet as pq import s3fs s3 = s3fs.S3FileSystem() pandas_dataframe = pq.ParquetDataset('s3://your-bucket/',…
Tristan Tran
  • 1,351
  • 1
  • 10
  • 36