I have a polars dataframe as
pl.DataFrame({'doc_id':[
['83;45;32;65;13','7;8;9'],
['9;4;5','4;2;7;3;5;8;10;11'],
['1000;2000','76;34;100001;7474;2924'],
['100;200','200;100'],
['3;4;6;7;10;11','1;2;3;4;5']
]})
each list consist of document id's separated with semicolon, if any of list element has got higher semicolon its index needs to be found and create a new column as len_idx_at and fill in with the index number.
For example:
['83;45;32;65;13','7;8;9']
This list is having two elements, in a first element there are about 4 semicolon hence its has 5 documents, similarly in a second element there are about 2 semicolons and it means it has 3 documents.
Here we should consider an index of a highest document counts element in the above case - it will be 0 index because it has 4 semicolons'.
the expected output as:
shape: (5, 2)
┌─────────────────────────────────────┬────────────┐
│ doc_id ┆ len_idx_at │
│ --- ┆ --- │
│ list[str] ┆ i64 │
╞═════════════════════════════════════╪════════════╡
│ ["83;45;32;65;13", "7;8;9"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["9;4;5", "4;2;7;3;5;8;10;11"] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["1000;2000", "76;34;100001;7474... ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["100;200", "200;100"] ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["3;4;6;7;10;11", "1;2;3;4;5"] ┆ 0 │
└─────────────────────────────────────┴────────────┘
In case of all elements in a list has equal semicolon counts, zero index would be preferred as showed in above output