Tokenizing the text and count in a dataframe based on other column

Question

I need to tokenize the data but seems really confusing. I have data like this:

TEXT               Author               Date
This is a Cat       Jane                 1.01.1997
This is a Dog       Sara                 1.02.2009
I have a cat        Lesner               5.07.2001

I need output like this:

Date:
1.01.1997    This
1.01.1997    is
1.01.1997     a
1.01.1997    cat
.
.
.
.

Is there any way to achieve output like this?

jezrael · Accepted Answer · 2020-02-12T11:25:24.293

2

Use Series.str.split with Series.explode working in pandas 0.25+ for Series:

s = df.set_index('Date')['TEXT'].str.split().explode()
print (s)
Date
1.01.1997    This
1.01.1997      is
1.01.1997       a
1.01.1997     Cat
1.02.2009    This
1.02.2009      is
1.02.2009       a
1.02.2009     Dog
5.07.2001       I
5.07.2001    have
5.07.2001       a
5.07.2001     cat
Name: TEXT, dtype: object

If want 2 columns DataFrame add Series.reset_index:

df = s.reset_index(name='text')
print (df)
         Date  text
0   1.01.1997  This
1   1.01.1997    is
2   1.01.1997     a
3   1.01.1997   Cat
4   1.02.2009  This
5   1.02.2009    is
6   1.02.2009     a
7   1.02.2009   Dog
8   5.07.2001     I
9   5.07.2001  have
10  5.07.2001     a
11  5.07.2001   cat

edited Feb 12 '20 at 11:25

answered Feb 12 '20 at 11:13

jezrael

822,522
95
1,334
1,252

2

Looks like: `df.set_index('Date')['TEXT'].str.split().explode()` might match the output better... but it's not clear whether the OP wants a DF from it or whether a Series will suffice. – Jon Clements Feb 12 '20 at 11:24
@jezrael can you explain also that how we can remove the duplicate.. for example `2011-03-17 [', Hinterher, ist, man, immer, schlauer:, Hät, man, ist']` – s_khan92 Feb 12 '20 at 14:03
1

@s_khan92 - Add `df = df.drop_duplicates()` after my solution – jezrael Feb 12 '20 at 14:04
Actually i dont want to delete all the duplicate items... I just want to delete based on the date. – s_khan92 Feb 12 '20 at 14:05
@s_khan92 - yop, it delete by date, because it deelte by all 2 columns, because columns names are not specified, it is same like `df = df.drop_duplicates(['Date','text'])` – jezrael Feb 12 '20 at 14:06
i tried that already but got this error: `SystemError: returned a result with an error set` – s_khan92 Feb 12 '20 at 14:09
@s_khan92 - Maybe `df = df.drop_duplicates(subset=['Date','text'])`, check [this](https://stackoverflow.com/questions/48131812/get-unique-values-of-multiple-columns-as-a-new-dataframe-in-pandas/48131825#48131825) – jezrael Feb 12 '20 at 14:10
`TypeError: drop_duplicates() got an unexpected keyword argument 'subset'` – s_khan92 Feb 12 '20 at 14:12
@s_khan92 - So your solution is `s = df.set_index('Date')['TEXT'].str.split().explode()`, then `df = s.reset_index(name='text') print (df)` and last `df = df.drop_duplicates()` ? – jezrael Feb 12 '20 at 14:13

Tokenizing the text and count in a dataframe based on other column

1 Answers1