1

I have a DataFrame, and one column is "lang" for "language."

Two different values in this column are "en" for "English" and "en-gb" for "British English."

There are numerous other values in this row, including "es" for "Spanish, "fr" for "French," and so on.

So it looks something like this:

user        lang       id

joe         en         77788
jim         en-gb      23323
pedro       es         12134
tom         en         53892
juan        es         24434
phillippe   fr         04211
george      en-gb      99999

For the purposes of my analysis, I want to count the 'en' and 'en-gb' values together as being the same "en" or "English" value. Perhaps I could put just this column into a Series and then count them as one, or I could change the "en-gb" values with "en."

TJE
  • 570
  • 1
  • 5
  • 20

3 Answers3

1

If you want the first two letters you can use string slicing i.e .str[:2] So we can consider language divisions as one.

df['lang'].str[:2]
0    en
1    en
2    es
3    en
4    es
5    fr
6    en
Name: lang, dtype: object

Now you got the series store it in one of the columns like

df['new'] = df['lang'].str[:2]

Merge with key as new. Hope it helps

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
0

You can change the column using .str[:2] as Bharath suggested. If you want to keep the column unchanged, you can use groupby on that column directly. Say you want to find the count of users for each language,

df_new = df.groupby(df.lang.str[:2]).user.count()

Or

df_new = df.groupby(df.lang.str.split('-').str[0]).user.count()

will return

lang
en    4
es    2
fr    1

And your original data is unaffected

Vaishali
  • 37,545
  • 5
  • 58
  • 86
0

By using replace

df=df.replace({'en-gb':'en'})
df
Out[358]: 
        user lang     id
0        joe   en  77788
1        jim   en  23323
2      pedro   es  12134
3        tom   en  53892
4       juan   es  24434
5  phillippe   fr   4211
6     george   en  99999
BENY
  • 317,841
  • 20
  • 164
  • 234