Questions tagged [python-polars]

Polars is a DataFrame library/in-memory query engine.

The Polars core library is written in Rust and uses Arrow, the native arrow2 Rust implementation, as its foundation. It offers Python and JavaScript bindings, which serve as a wrapper for functionality implemented in the core library.

Links

1331 questions
0
votes
3 answers

polars dropna equivalent on list of columns

I'm a new polars user. Pandas has df.dropna. I need to replace this functionality, but I haven't found a dropna in polars. Searching for drona currently yields no results in the Polars User Guide. My specific problem: convert the following statement…
Callum Rollo
  • 515
  • 3
  • 12
0
votes
0 answers

Lazily reading a parquet file with binary datatype in PyPolars

I hope this is a good question, if I should post this as an issue on the PyPolars GitHub instead, please let me know. I have a quite large parquet file where some columns contain binary data. These columns are not interesting for me right now, so it…
0
votes
1 answer

Principles of immutability and copy-on-write in polars python api

Hi I'm working on this fan fiction project of a full feature + syntax translation of pypolars to R called "minipolars". I understand the pypolars API e.g. DataFrame in generel elicits immutable-behavior or isch the same as 'copy-on-write' behaviour.…
Soren Havelund Welling
  • 1,823
  • 1
  • 16
  • 23
0
votes
1 answer

Cluster a column

I have a column I want to cluster: df = pl.DataFrame({"values": [0.1, 0.5, 0.7, -0.2, 0.4, -0.7, 0.05]}) shape: (7, 1) ┌────────┐ │ values │ │ --- │ │ f64 │ ╞════════╡ │ 0.1 │ ├╌╌╌╌╌╌╌╌┤ │ 0.5 │ ├╌╌╌╌╌╌╌╌┤ │ 0.7 │ ├╌╌╌╌╌╌╌╌┤ │ -0.2 …
Sigi
  • 53
  • 8
0
votes
0 answers

Polars-python. Is it possible to read multiple files with globbing patterns using as storage_options adlfs?

Loading multiple files using glob patterns if we run it in a local filesystem, as it's written in the documentation. However, if I try to load several files at once from the Azure Data lake Gen2, it only loads into the DataFrame the first file that…
Javi Hernandez
  • 314
  • 8
  • 17
0
votes
1 answer

Use f-string in polars dataframe with a loop

I am trying to create a list of new columns based on the latest column. I can achieve this by using with_columns() and simple multiplication. Given I want a long list of new columns, I am thinking to use a loop with an f-string to do it. However, I…
codedancer
  • 1,504
  • 9
  • 20
0
votes
1 answer

Polars - how to parallelize lambda that uses only Polars expressions?

This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong? (the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]'…
Tim
  • 236
  • 2
  • 8
0
votes
1 answer

How to check if dataframe columns contains any information except NULL/EMPTY and show them in a new column in python polars?

I have a dataframe as- pl.DataFrame({'last_name':['Unknown','Mallesham',np.nan,'Bhavik','Unknown'], 'first_name_or_initial':['U',np.nan,'TRUE','yamulla',np.nan], …
myamulla_ciencia
  • 1,282
  • 1
  • 8
  • 30
0
votes
1 answer

Broadcast a single cell value to a column

In pandas it is possible to broadcast a single value to an entire column or even a slice: frame.loc[start_index:stop_index, 'a'] = frame.loc[some_row_index, 'a'] that is, a single value being broadcast to a Series. I tried something similar with…
sobek
  • 1,386
  • 10
  • 28
0
votes
1 answer

Overwrite a slice of a timeseries with a value

I have some timeseries data in the form of a pl.DataFrame object with a datetime col and a data col. I would like to correct an error in the data that occurs during a distinct time range by overwriting it with a value. Now in pandas, one would use…
sobek
  • 1,386
  • 10
  • 28
0
votes
0 answers

Connectorx Server requested a connection to an alternative address in azure pipeline

connecting to sql server using connectorx and polars. everything works correctly locally and not getting any errors. however, when using azure pipelines to run code getting the following error "result = _read_sql(RuntimeError: Server requested a…
tommyt
  • 309
  • 5
  • 15
0
votes
1 answer

Does `pl.concat([lazyframe1, lazyframe2])` strictly preserve the order of the input dataframes?

Suppose I create a polars Lazyframe from a list of csv files using pl.concat(): df = pl.concat([pl.scan_csv(file) for file in ['file1.csv', 'file2.csv']]) Is the data in the resulting dataframe guaranteed to have the exact order of the input files,…
DataWiz
  • 401
  • 6
  • 14
0
votes
1 answer

Specify string format for numeric during conversion to pl.Utf8

Is there any way to specify a format specifier if, for example, casting a pl.Float32, without resorting to complex searches for the period character? As in something like: s = pl.Series([1.2345, 2.3456, 3.4567]) s.cast(pl.Utf8, fmt="%0.2f") # fmt…
NedDasty
  • 192
  • 1
  • 8
0
votes
1 answer

Is it semantically possible to optimize LazyFrame -> Fill Null -> Cast to Categorical?

Here is a trivial benchmark based on a real-life workload. import gc import time import numpy as np import polars as pl df = ( # I have a dataframe like this from reading a csv. pl.Series( name="x", values=np.random.choice( …
0
votes
1 answer

LazyFrame memory usage (polars.scan_csv vs polars.read_csv, single threaded)

I have some sample csv files and two programs to read/filter/concat the csvs. Here is the LazyFrame version of the code: import os os.environ["POLARS_MAX_THREADS"] = "1" import polars as pl df = pl.concat( [ …