-1

Thanks for the prompt responses. Based on the responses, I have modified the question and also provided numeric code example.

I am from Market Research industry. We analyse survey databases. One of the requirements of the survey tables is that blank rows & columns should not get suppressed. Blank rows and / or columns may result when we are generating table on filtered database.

To avoid this zero suppression, we create a blank table with all rows / columns, then create actual table using Pandas and update the blank table with the actual table numbers using Pandas pd.update function. This way, we retain rows / columns with zero estimates. My sincere apologies for not pasting code as this is my first question on Stack Overflow.

Here's the example dataframe:

dict = { 'state': 
    ['state 1', 'state 2', 'state 3', 'state 4', 'state 5', 'state 6', 'state 7', 'state 8', 'state 9', 'state 10'],
     'development': ['Low', 'Medium', 'Low', 'Medium', 'High', 'Low', 'Medium', 'Medium', 'Low', 'Medium'],
     'investment': ['50-500MN', '<50MN', '<50MN', '<50MN', '500MN+', '50-500MN', '<50MN', '50-500MN', '<50MN', '<50MN'],
     'population': [22, 19, 25, 24, 19, 21, 33, 36, 22, 36],
     'gdp': [18, 19, 29, 23, 22, 19, 35, 18, 26, 27]
    }

I convert it into a dataframe:

df = pl.DataFrame(dict)

I filter it using a criteria:

df2 = df.filter(pl.col('development') != 'High')

And then generate a pivot table

df2.pivot(index='development', columns='investment', values='gdp')

The resulting table has one row suppressed ('High' development) and one column suppressed ('>500MN' investment).

The solution I am looking for is to update the blank table with all rows and columns with the pivot table generated. Wherever there are no values, they would be replaced with a zero.

  • Based on your updates: Is the result you're after `df.pivot(index="development", columns="investment", values="gdp")` but with the `High` row set to `null`? – jqurious Jan 27 '23 at 16:31
  • Please help me in rolling back the question. I can't see an option to roll it back. Thanks. – Suresh Nimbalkar Jan 28 '23 at 02:34
  • yes @jqurious. But I would not like to do it manually as I may not know which columns are selected. Hence, I don't want to hard-code the columns. I would also like to retain blank column of '500MN+' which gets suppressed in Pivot. – Suresh Nimbalkar Feb 02 '23 at 14:10
  • I don't think you need to hard-code any columns? Perhaps I'm not understanding correctly. I've added an answer showing what I mean. – jqurious Feb 02 '23 at 14:24

2 Answers2

1

You haven't written any code, so I won't either, but you can do what's suggested in https://github.com/pola-rs/polars/issues/6211

ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65
  • This link is helpful, but is specific to the example. I would not know in advance what variables / tables are required by the library user. In such a case, my solution has to be generic, like Pandas .update, which works like a charm. – Suresh Nimbalkar Jan 27 '23 at 17:20
1

What you want is a left join.

Let's say you have:

studentsdf=pl.DataFrame({'Name':students})
datadf=pl.DataFrame({'name':[x[0] for x in data], 'age':[x[1] for x in data]})

Then you'd do:

studentsdf.join(datadf, on='name', how='left')

shape: (4, 2)
┌────────┬──────┐
│ name   ┆ age  │
│ ---    ┆ ---  │
│ str    ┆ i64  │
╞════════╪══════╡
│ Alex   ┆ 10   │
│ Bob    ┆ 12   │
│ Clarke ┆ null │
│ Darren ┆ 13   │
└────────┴──────┘

If you want to "update" the studentsdf with that new info you'd just assign it like this:

studentsdf=studentsdf.join(datadf, on='name', how='left')

Even though that implies you're making a copy, under the hood, polars is just moving memory pointers around not copying all the underlying data.

Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • Thanks @Dean. This is the approach I am adopting to design a function to update blank data table. I need to find a way to address the missing columns. – Suresh Nimbalkar Jan 27 '23 at 17:24