How is Python Polars treating the index?

Question

I want to try out polars in Python so what I want to do is concatenate several dataframes that are read from jsons. When I change the index to date and have a look at lala1.head() I see that the column date is gone, so I basically lose the index. Is there a better solution or do I need to sort by date, which basically does the same as setting the index to date?

import polars as pl

quarterly_balance_df = pl.read_json('../AAPL/single_statements/1985-09-30-quarterly_balance.json')


q1 = quarterly_balance_df.lazy().with_column(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))
quarterly_balance_df = q1.collect()
q2 = quarterly_balance_df.lazy().with_column(pl.col("fillingDate").str.strptime(pl.Date, "%Y-%m-%d"))
quarterly_balance_df = q2.collect()
q3 = quarterly_balance_df.lazy().with_column(pl.col("acceptedDate").str.strptime(pl.Date, "%Y-%m-%d"))
quarterly_balance_df = q3.collect()

quarterly_balance_df2 = pl.read_json('../AAPL/single_statements/1986-09-30-quarterly_balance.json')

q1 = quarterly_balance_df2.lazy().with_column(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))
quarterly_balance_df2 = q1.collect()
q2 = quarterly_balance_df2.lazy().with_column(pl.col("fillingDate").str.strptime(pl.Date, "%Y-%m-%d"))
quarterly_balance_df2 = q2.collect()
q3 = quarterly_balance_df2.lazy().with_column(pl.col("acceptedDate").str.strptime(pl.Date, "%Y-%m-%d"))
quarterly_balance_df2 = q3.collect()

lala1 = pl.from_pandas(quarterly_balance_df.to_pandas().set_index('date'))
lala2 = pl.from_pandas(quarterly_balance_df.to_pandas().set_index('date'))

test = pl.concat([lala1,lala2])

score 12 · Accepted Answer · answered Apr 05 '22 at 23:08

Polars intentionally eliminates the concept of an index. Indeed, the Polars "Cookbook" goes so far as to state this about indexes:

They are not needed. Not having them makes things easier. Convince me otherwise

Indeed, the from_pandas method ignores any index. For example, if we start with this data:

import polars as pl

df = pl.DataFrame(
    {
        "key": [1, 2],
        "var1": ["a", "b"],
        "var2": ["r", "s"],
    }
)
print(df)

shape: (2, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ str  │
╞═════╪══════╪══════╡
│ 1   ┆ a    ┆ r    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ b    ┆ s    │
└─────┴──────┴──────┘

Now, if we export this Polars dataset to Panda, set key as the index in Pandas, and then re-import to Polars, you'll see the 'key' column disappear.

pl.from_pandas(df.to_pandas().set_index("key"))

shape: (2, 2)
┌──────┬──────┐
│ var1 ┆ var2 │
│ ---  ┆ ---  │
│ str  ┆ str  │
╞══════╪══════╡
│ a    ┆ r    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b    ┆ s    │
└──────┴──────┘

This is why your Date column disappeared.

In Polars, you can sort, summarize, or join by any set of columns in a DataFrame. No need to declare an index.

I recommend looking through the Polars Cookbook. It's a great place to start. And there's a section for those coming from Pandas.

Hi thanks for your answer, I went through the cookbook, but might have over read that exact part. — daeda, Apr 06 '22 at 19:09

How is Python Polars treating the index?

1 Answers1