3

I have following dataframe (called items) for example:

| index | itemID | maintopic | subtopics          |
|:----- |:------:|:---------:| ------------------:|
| 1     | 235    | FBR       | [FZ, 1RH, FL]      |
| 2     | 1787   | NaN       | [1RH, YRS, FZ, FL] |
| 3     | 2454   | NaN       | [FZX, 1RH, FZL]    |
| 4     | 3165   | NaN       | [YHS]              |

I would like to fill the NaN-Values in the maintopic-column with the first element of the subtopics list which starts with a letter. Does someone has an idea? (Question No 1)

I tried this, but it didn´t work:

import pandas as pd
import string
alphabet = list(string.ascii_lowercase)
    
items['maintopic'] = items['maintopic'].apply(lambda x : items['maintopic'].fillna(items['subtopics'][x][0]) if items['subtopics'][x][0].lower().startswith(tuple(alphabet)) else x)

Advanced (Question No 2): Even better would be to have a look at all elements of the subtopics list and if there are more elements which have the first letter or even the first and the second letter in common, then I would like to take this. For example in line 2 there is FZ and FL, so i would like to fill the maintopic in this row with an F. And in line 3 there is FZX and FZL, then I would like to fill the maintopic with FZ. But if this is way too complicated then I would be also very happy with an answer to Question No 1.

I appreciate any help!

leolumpy
  • 63
  • 6

3 Answers3

1

Try:

from itertools import chain, combinations


def commonprefix(m):
    "Given a list of pathnames, returns the longest common leading component"
    if not m:
        return ""
    s1 = min(m)
    s2 = max(m)
    for i, c in enumerate(s1):
        if c != s2[i]:
            return s1[:i]
    return s1


def powerset(iterable, n=0):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(n, len(s) + 1))


def choose(x):
    if not isinstance(x, list):
        return x

    if len(x) == 1:
        return x[0]

    filtered = [v for v in x if not v[0].isdigit()]
    if not filtered:
        return np.nan

    longest = ""
    for s in powerset(filtered, 2):
        pref = commonprefix(s)
        if len(pref) > len(longest):
            longest = pref

    return filtered[0] if longest == "" else longest


m = df["maintopic"].isna()
df.loc[m, "maintopic"] = df.loc[m, "subtopics"].apply(choose)
print(df)

Prints:

   index  itemID maintopic           subtopics
0      1     235       FBR       [FZ, 1RH, FL]
1      2    1787         F  [1RH, YRS, FZ, FL]
2      3    2454        FZ     [FZX, 1RH, FZL]
3      4    3165       YHS               [YHS]

EDIT: Added checks for a list/float.

Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Hi, thanks a lot for your answer, I appreciate it! Unfortunately it gives me an error back if I use it on my whole dataframe: "TypeError: 'float' object is not iterable". Do you have an idea what the problem could be in my dataframe? – leolumpy Apr 28 '21 at 08:02
  • @leolumpy Which line gives you the error? Is every row in "subtopics" a list? Is every element in this list a string? – Andrej Kesely Apr 28 '21 at 08:04
  • This line here: items.loc[m, "maintopic"] = items.loc[m, "subtopics"].apply(choose) Actually yes, but I am not for 100 % sure. The dataframe is really big with 78000 rows. Can I check it somehow? – leolumpy Apr 28 '21 at 08:10
  • @leolumpy Can you paste the full stack trace? That't only the highest level of it. – Andrej Kesely Apr 28 '21 at 08:11
  • I am trying, but the formatting is really bad. Sorry for that, I am pretty new, it´s my first post on stackoverflow. – leolumpy Apr 28 '21 at 08:26
  • ---> 39 items.loc[m, "maintopic"] = items.loc[m, "subtopics"].apply(choose) -> 4135 mapped = lib.map_infer(values, f, convert=convert_dtype) ---> 23 filtered = [v for v in x if not v[0].isdigit()] TypeError: 'float' object is not iterable. The whole stack is unfortunately too long ... – leolumpy Apr 28 '21 at 08:31
  • @leolumpy I updated my code. Added some checks. – Andrej Kesely Apr 28 '21 at 08:48
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231689/discussion-between-leolumpy-and-andrej-kesely). – leolumpy Apr 28 '21 at 09:01
1

Try this for the first question:

import pandas as pd
import numpy as np


def fill_value(sub):
    for i in sub:
        if i[0].isalpha():
            return i
    return sub[0]


data = {
    'maintopic': ['FBR', np.nan, np.nan, np.nan],
    'subtopic': [['FZ', '1RH', 'FL'] , ['1RH', 'YRS', 'FZ', 'FL'], ['FZX', '1RH', 'FZL'], ['YHS']]
}

df = pd.DataFrame(data)
print('Before\n', df)
df['maintopic'] = df.apply(
    lambda row: fill_value(row['subtopic']) if pd.isnull(row['maintopic']) else row['maintopic'],
    axis=1
)
print('\nAfter\n', df)

Output:

Before
   maintopic            subtopic
0       FBR       [FZ, 1RH, FL]
1       NaN  [1RH, YRS, FZ, FL]
2       NaN     [FZX, 1RH, FZL]
3       NaN               [YHS]

After
   maintopic            subtopic
0       FBR       [FZ, 1RH, FL]
1       YRS  [1RH, YRS, FZ, FL]
2       FZX     [FZX, 1RH, FZL]
3       YHS               [YHS]

You can change the fill_value function to return the desired values to fill NaN values. For now, I have returned the first value of the subtopic that starts with an alphabet.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Ishwar
  • 338
  • 2
  • 11
  • Hi, thanks a lot for your answer! Your solution looks very nice and if I copy paste, then it works absolutely fine! But if I use it on my whole dataframe, it says "TypeError: 'float' object is not iterable". :( Do you have an idea? – leolumpy Apr 28 '21 at 08:00
1

You can do like this : get all substrings that start with first letter in each value in the list of subtopics columns and build a counter and then sort the items in the counter based on their frequency. If the items' frequency is same consider the longest string.

from collections import Counter
from functools import cmp_to_key
def get_main_topic_modified(m, l):
    if m is not np.nan:
       return m
    if len(l) == 1:
       return l[0]
    res = []
    for s in l:
        il = [s[:i+1] for i in range(len(s)-1)]
        res.append(il)
    res = [item for s in res for item in s]
    c = Counter(res)
    d = dict(c)
    l = list(d.items())
    
    l.sort(key=cmp_to_key(lambda x, y: len(y[0])-len(x[0]) if x[1] == y[1] else y[1] - x[1]))
    
    return l[0][0]

df['maintopic'] = df[['maintopic', 'subtopics']].apply(
                       lambda x : get_main_topic_modified(*x), axis = 1)

Output:

  index itemID  maintopic            subtopics
0     1    235        FBR        [FZ, 1RH, FL]
1     2   1787          F   [1RH, YRS, FZ, FL]
2     3   2454         FZ      [FZX, 1RH, FZL]
3     4   3165        YHS                [YHS]
SomeDude
  • 13,876
  • 5
  • 21
  • 44