6

I am trying to loop through a Polars recordset using the following code:


import polars as pl

mydf = pl.DataFrame(
    {"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
     "Name": ["John", "Joe", "James"]})

print(mydf)

│start_date  ┆ Name  │
│ ---        ┆ ---   │
│ str        ┆ str   │
╞════════════╪═══════╡
│ 2020-01-02 ┆ John  │
│ 2020-01-03 ┆ Joe   │
│ 2020-01-04 ┆ James │

for row in mydf.rows():
    print(row)

('2020-01-02', 'John')
('2020-01-03', 'Joe')
('2020-01-04', 'James')

Is there a way to specifically reference 'Name' using the named column as opposed to the index? In Pandas this would look something like:

import pandas as pd

mydf = pd.DataFrame(
    {"start_date": ["2020-01-02", "2020-01-03", "2020-01-04"],
     "Name": ["John", "Joe", "James"]})

for index, row in mydf.iterrows():
    mydf['Name'][index]

'John'
'Joe'
'James'
kristianp
  • 5,496
  • 37
  • 56
John Smith
  • 2,448
  • 7
  • 54
  • 78

2 Answers2

10

You can specify that you want the rows to be named

for row in mydf.rows(named=True):
    print(row)

It will give you a dict:

{'start_date': '2020-01-02', 'Name': 'John'}
{'start_date': '2020-01-03', 'Name': 'Joe'}
{'start_date': '2020-01-04', 'Name': 'James'}

You can then call row['Name']

Note that:

  • previous versions returned namedtuple instead of dict.
  • it's less memory intensive to use iter_rows
  • overall it's not recommended to iterate through the data this way

Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.

0x26res
  • 11,925
  • 11
  • 54
  • 108
  • Hi @0x26res. Thank you for taking the time out to explain this to me. If i use mydf.iterrows(), like the following `for row in mydf.iterrows(named=True): row['Name']`, I get the error `Traceback (most recent call last): File "", line 2, in TypeError: tuple indices must be integers or slices, not str` – John Smith Feb 02 '23 at 13:36
  • I am also happy to achieve the same result via a columnar method but have not seen much documentation around this particular type of iteration in Polars. Thank you again for your help – John Smith Feb 02 '23 at 13:37
  • 1
    As mentioned, `previous versions returned namedtuple instead of dict.`. Try `row.Name` – 0x26res Feb 02 '23 at 14:10
  • 2
    @JohnSmith iterating through rows isn't documented well in polars because it's highly discouraged as it basically circumvents all the optimization. It's like if you buy a Ferrari and then ask how to drive it really slowly and quietly. What result are you ultimately after? – Dean MacGregor Feb 02 '23 at 22:26
  • Hi @DeanMacGregor, I have a table which has names and dates. Each row is has variables that i "inject" into the SQL which in the case above would generate 3 SQL statements and generate 3 reports. I don't have access to create stored procedures and to run everything in one go due to spool space issues. Thank you for your question – John Smith Feb 03 '23 at 08:06
1

You would use select for that

names = mydf.select(['Name'])
for row in names:
    print(row)
Kien Truong
  • 11,179
  • 2
  • 30
  • 36
  • Hi @Kien Truong. Thank you very much for your quick reply. I used name as an example. I actually would want to get the date and the name as these items are then sent to an SQL statement in the real code as opposed to this example. Each row in this table would generate a separate SQL – John Smith Feb 02 '23 at 13:27