Questions tagged [python-polars]

Polars is a DataFrame library/in-memory query engine.

The Polars core library is written in Rust and uses Arrow, the native arrow2 Rust implementation, as its foundation. It offers Python and JavaScript bindings, which serve as a wrapper for functionality implemented in the core library.

Links

1331 questions
2
votes
2 answers

How to average lists on different columns using polars LazyFrame

I have a polars LazyFrame which has 3 columns of type nullable list[f64], something like this. import polars as pl lf = pl.DataFrame({ "1": [ [0.0, 1.1, 2.2], [0.0, 1.1, 2.2], [0.0, 1.1, 2.2], None, ], …
Nemoos
  • 35
  • 4
2
votes
1 answer

Pythonic way to update a column of a Polars data frame based on matching condition from another column

In Polars, what is an one-liner way to update items of a column based on matching condition from another column, may by applying lambda? For example, I would like to multiply items in col1 with 1000 if items in col2 are equal to 'a'. Here's a crude…
beta green
  • 113
  • 10
2
votes
1 answer

polars python: split string in lazyframe column into new columns

I could not find any helpful info on that so if anyone has some input.... please.... I need to split all string in a column in a big (ca 30 GB) csv file. For that I tried out polars. Seems to work fine but I dont understand how I can map the values…
2
votes
1 answer

Using Polars to process multiple csv files and partition them differently

I have a directory full of csv files that are each 1m rows or more. Each csv file has an identical structure and has a "Date" column with no obvious ordering of these dates. I want to read in the csv files and then split them up by month-year…
deanm1
  • 243
  • 1
  • 2
  • 4
2
votes
3 answers

How can I perform operations between a list and scalar column in polars

In python polars, I was wondering if it will be possible to use .eval() to perform an operation between an element and a column. For example, given the following dataframe: import polars as pl df = pl.DataFrame({"list": [[2, 2, 2], [3, 3, 3]],…
2
votes
2 answers

With a Python context manager, I want to have a print statement that logs the row count before and after an operation

Using a function defined in a Python context manager, I want to modify a Polars dataframe by reassignment. I then went the function in the context manager to print the previous and new row counts. I tried the following: import polars as pl def…
2
votes
2 answers

How to unnest multiple struct columns with similar Structs in Polars, using disambiguating suffix

I have 2 columns with similar Structs (same field names, field types, etc.). nest = pl.DataFrame({ 'a':[{'x':1,'y':10},{'x':2,'y':20},], 'b':[{'x':3,'y':30},{'x':4,'y':40},] }) print(nest) shape: (2, 2) ┌───────────┬───────────┐ │ a …
Des1303
  • 79
  • 4
2
votes
0 answers

Polars: inefficiency of over expression

I found out that at least for the scenario below, doing over is much slower (2~3x) than doing groupby/agg + explode. And, the results are exactly the same. Based on this finding, I have the following questions: Is such behaviour as expected? If so,…
lebesgue
  • 837
  • 4
  • 13
2
votes
1 answer

Why is my value returned as a pl.Float32 and not a pl.Date?

I have a dataframe with customers and order dates. Order dates can be either a pl.Date or a null, but currently this column contains only null values. I want to create a new column, "startdate_before_override", which is set to either the…
Balthazar
  • 81
  • 9
2
votes
2 answers
2
votes
2 answers

filling date gaps with polars

I have a problem I'm trying to solve but can't figure it out. I have something similar to this table: data = pl.DataFrame( {'id': [1,1,1,1,2,2,2,3,3], 'date': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-05-01', '2023-02-01',…
nam0_0
  • 59
  • 4
2
votes
4 answers

apply operations from string list to one (or more) column(s) in polars

I would need to apply multiple simple operations (sum/mean/max/min/median etc) to a single column. Is there a way to write that concisely without repeating myself? Right now I would need to write all these manually, df.select(pl.col("a").max(),…
Mark Wang
  • 2,623
  • 7
  • 15
2
votes
1 answer

How to get difference of months replicating Excel behavior

I would like to subtract two date columns and get their difference in months unit instead of days but haven't been able to. Data: import polars as pl from datetime import datetime test_df = pl.DataFrame( { "dt_end": [datetime(2022, 1,…
ViSa
  • 1,563
  • 8
  • 30
2
votes
1 answer

Polars - speedup by using partition_by and collect_all

Example setup Warning: 5gb memory df creation import time import numpy as np import polars as pl rng = np.random.default_rng(1) nrows = 50_000_000 df = pl.DataFrame( dict( id=rng.integers(1, 50, nrows), id2=rng.integers(1,…
lebesgue
  • 837
  • 4
  • 13
2
votes
3 answers

Sum columns of one dataframe based on another dataframe

I have two dataframes that look like those: df1 = pl.DataFrame( { "Name": ["A", "B", "C", "D"], "Year": [2001, 2003, 2003, 2004] } ) df2 = pl.DataFrame( { "Name": ["A", "B", "C", "D"], "2001": [111, 112,…
kodkod
  • 1,556
  • 4
  • 21
  • 43