How to get an index of maximum count of a required string in a list column of polars data frame?

Question

I have a polars dataframe as

pl.DataFrame({'doc_id':[
    ['83;45;32;65;13','7;8;9'],
    ['9;4;5','4;2;7;3;5;8;10;11'],
    ['1000;2000','76;34;100001;7474;2924'],
    ['100;200','200;100'],
    ['3;4;6;7;10;11','1;2;3;4;5']
]})

each list consist of document id's separated with semicolon, if any of list element has got higher semicolon its index needs to be found and create a new column as len_idx_at and fill in with the index number.

For example:

['83;45;32;65;13','7;8;9']

This list is having two elements, in a first element there are about 4 semicolon hence its has 5 documents, similarly in a second element there are about 2 semicolons and it means it has 3 documents.

Here we should consider an index of a highest document counts element in the above case - it will be 0 index because it has 4 semicolons'.

the expected output as:

shape: (5, 2)
┌─────────────────────────────────────┬────────────┐
│ doc_id                              ┆ len_idx_at │
│ ---                                 ┆ ---        │
│ list[str]                           ┆ i64        │
╞═════════════════════════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"]         ┆ 0          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"]      ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"]              ┆ 0          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"]      ┆ 0          │
└─────────────────────────────────────┴────────────┘

In case of all elements in a list has equal semicolon counts, zero index would be preferred as showed in above output

score 1 · Accepted Answer · answered Oct 26 '22 at 18:59

df.with_columns(
    [
        # Get first and second list of documents as string element.
        pl.col("doc_id").arr.get(0).alias("doc_list1"),
        pl.col("doc_id").arr.get(1).alias("doc_list2"),
    ]
)
.with_columns(
    [
        # Split each doc list element on ";" and count number of splits.
pl.col("doc_list1").str.split(";").arr.lengths().alias("doc_list1_count"),
        pl.col("doc_list2").str.split(";").arr.lengths().alias("doc_list2_count")
    ]
)
.with_column(
    # Get the wanted index based on which list is longer.
    pl.when(
        pl.col("doc_list1_count") >= pl.col("doc_list2_count")
    )
    .then(0)
    .otherwise(1)
    .alias("len_idx_at")
)

Out[11]: 
shape: (5, 6)
┌─────────────────────────────────────┬────────────────┬────────────────────────┬─────────────────┬─────────────────┬────────────┐
│ doc_id                              ┆ doc_list1      ┆ doc_list2              ┆ doc_list1_count ┆ doc_list2_count ┆ len_idx_at │
│ ---                                 ┆ ---            ┆ ---                    ┆ ---             ┆ ---             ┆ ---        │
│ list[str]                           ┆ str            ┆ str                    ┆ u32             ┆ u32             ┆ i64        │
╞═════════════════════════════════════╪════════════════╪════════════════════════╪═════════════════╪═════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"]         ┆ 83;45;32;65;13 ┆ 7;8;9                  ┆ 5               ┆ 3               ┆ 0          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"]      ┆ 9;4;5          ┆ 4;2;7;3;5;8;10;11      ┆ 3               ┆ 8               ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1000;2000      ┆ 76;34;100001;7474;2924 ┆ 2               ┆ 5               ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"]              ┆ 100;200        ┆ 200;100                ┆ 2               ┆ 2               ┆ 0          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"]      ┆ 3;4;6;7;10;11  ┆ 1;2;3;4;5              ┆ 6               ┆ 5               ┆ 0          │
└─────────────────────────────────────┴────────────────┴────────────────────────┴─────────────────┴─────────────────┴────────────┘

```pl.col("doc_id").arr.get(0).alias("doc_list1")``` here how should i specify to retrive all the elements from list — myamulla_ciencia, Oct 27 '22 at 00:38
Here the aim is to get the highest length element from a list — myamulla_ciencia, Oct 27 '22 at 01:01

How to get an index of maximum count of a required string in a list column of polars data frame?

1 Answers1