1

I need to tokenize the data but seems really confusing. I have data like this:

TEXT               Author               Date
This is a Cat       Jane                 1.01.1997
This is a Dog       Sara                 1.02.2009
I have a cat        Lesner               5.07.2001

I need output like this:

Date:
1.01.1997    This
1.01.1997    is
1.01.1997     a
1.01.1997    cat
.
.
.
.

Is there any way to achieve output like this?

s_khan92
  • 969
  • 8
  • 21

1 Answers1

2

Use Series.str.split with Series.explode working in pandas 0.25+ for Series:

s = df.set_index('Date')['TEXT'].str.split().explode()
print (s)
Date
1.01.1997    This
1.01.1997      is
1.01.1997       a
1.01.1997     Cat
1.02.2009    This
1.02.2009      is
1.02.2009       a
1.02.2009     Dog
5.07.2001       I
5.07.2001    have
5.07.2001       a
5.07.2001     cat
Name: TEXT, dtype: object

If want 2 columns DataFrame add Series.reset_index:

df = s.reset_index(name='text')
print (df)
         Date  text
0   1.01.1997  This
1   1.01.1997    is
2   1.01.1997     a
3   1.01.1997   Cat
4   1.02.2009  This
5   1.02.2009    is
6   1.02.2009     a
7   1.02.2009   Dog
8   5.07.2001     I
9   5.07.2001  have
10  5.07.2001     a
11  5.07.2001   cat
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 2
    Looks like: `df.set_index('Date')['TEXT'].str.split().explode()` might match the output better... but it's not clear whether the OP wants a DF from it or whether a Series will suffice. – Jon Clements Feb 12 '20 at 11:24
  • @jezrael can you explain also that how we can remove the duplicate.. for example `2011-03-17 [', Hinterher, ist, man, immer, schlauer:, Hät, man, ist']` – s_khan92 Feb 12 '20 at 14:03
  • 1
    @s_khan92 - Add `df = df.drop_duplicates()` after my solution – jezrael Feb 12 '20 at 14:04
  • Actually i dont want to delete all the duplicate items... I just want to delete based on the date. – s_khan92 Feb 12 '20 at 14:05
  • @s_khan92 - yop, it delete by date, because it deelte by all 2 columns, because columns names are not specified, it is same like `df = df.drop_duplicates(['Date','text'])` – jezrael Feb 12 '20 at 14:06
  • i tried that already but got this error: `SystemError: returned a result with an error set` – s_khan92 Feb 12 '20 at 14:09
  • @s_khan92 - Maybe `df = df.drop_duplicates(subset=['Date','text'])`, check [this](https://stackoverflow.com/questions/48131812/get-unique-values-of-multiple-columns-as-a-new-dataframe-in-pandas/48131825#48131825) – jezrael Feb 12 '20 at 14:10
  • `TypeError: drop_duplicates() got an unexpected keyword argument 'subset'` – s_khan92 Feb 12 '20 at 14:12
  • @s_khan92 - So your solution is `s = df.set_index('Date')['TEXT'].str.split().explode()`, then `df = s.reset_index(name='text') print (df)` and last `df = df.drop_duplicates()` ? – jezrael Feb 12 '20 at 14:13