4

First of all I am very new at pandas and am trying to lean so thorough answers will be appreciated.

I want to generate a pandas DataFrame representing a map witter tag subtoken -> poster where tag subtoken means anything in the set {hashtagA} U {i | i in split('_', hashtagA)} from a table matching poster -> tweet

For example:

In [1]: df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])

In [2]: df
Out[2]: 
      0                                     1
0   jim           i was like #yolo_omg to her
1  jack  You are so #yes_omg #best_place_ever
2  neil                     Yo #rofl_so_funny

And from that I want to get something like

      0          1
0   jim          yolo_omg
1   jim          yolo
2   jim          omg
3  jack          yes_omg
4  jack          yes
5  jack          omg
6  jack          best_place_ever
7  jack          best
8  jack          place
9  jack          ever
10 neil          rofl_so_funny
11 neil          rofl
12 neil          so
13 neil          funny

I managed to construct this mostrosity that actually does the job:

In [143]: df[1].str.findall('#([^\s]+)') \
    .apply(pd.Series).stack() \
    .apply(lambda s: [s] + s.split('_') if '_' in s else [s]) \
    .apply(pd.Series).stack().to_frame().reset_index(level=0) \
    .join(df, on='level_0', how='right', lsuffix='_l')[['0','0_l']]

Out[143]: 
        0              0_l
0 0   jim         yolo_omg
  1   jim             yolo
  2   jim              omg
  0  jack          yes_omg
  1  jack              yes
  2  jack              omg
1 0  jack  best_place_ever
  1  jack             best
  2  jack            place
  3  jack             ever
0 0  neil    rofl_so_funny
  1  neil             rofl
  2  neil               so
  3  neil            funny

But I have a very strong feeling that there are much better ways of doing this, especially given that the real dataset set is huge.

fakedrake
  • 6,528
  • 8
  • 41
  • 64
  • 1
    Seems like a reasonable question and I'm surprised no one has answered yet. You might want to edit to break up the lines into smaller pieces to make it more readable. – JohnE Aug 18 '14 at 14:36
  • One initial thought is that you are interspersing string methods with other data munging. I'd have to wonder if you don't just want to do all the string operations in one place with regular python and then read into a dataframe? Not sure if it would be faster but would almost certainly be simpler. – JohnE Aug 18 '14 at 14:56
  • Maybe I should have said that in but I read my data from an sql database with `frame_query` so I have my data in a dataframe from the get-go. As I said I have no strong opinions on which is the best practice. Would it be a good idea to process the data with regular python anyway? I am using that `lambda` anyway... – fakedrake Aug 18 '14 at 15:01
  • 1
    more experienced pandas users could probably give you more elegant code, but my suspicion is that you are using the incorrect tool for the job. Your single line of python might work at some stage, but is not going to be maintainable. Couple of more lines will give you code with standard python structures that you can then put back in dataframe if you need it in that format. – Joop Aug 19 '14 at 09:44

2 Answers2

1

pandas indeed has a function for doing this natively. Series.str.findall() This basically applies a regex and captures the group(s) you specify in it.

So if I had your dataframe:

df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]])

What I would do is first to set the names of your columns, like this:

df.columns = ['user', 'tweet']

Or do it on creation of the dataframe:

df = pd.DataFrame([["jim", "i was like #yolo_omg to her"], ["jack", "You are so #yes_omg #best_place_ever"], ["neil", "Yo #rofl_so_funny"]], columns=['user', 'tweet'])

Then I would simply apply the extract function with a regex:

df['tag'] = df["tweet"].str.findall("(#[^ ]*)")

And I would use the negative character group instead of a positive one, this is more likely to survive special cases.

firelynx
  • 30,616
  • 9
  • 91
  • 101
0

How about using list comprehensions in python and then reverting back to pandas? Requires a few lines of code but is perhaps more readable.

import re
get the hash tags
tags = [re.findall('#([^\s]+)', t) for t in df[1]]
make lists of the tags with subtokens for each user
st = [[t] + [s.split('_') for s in t] for t in tags]
subtokens = [[i for s in poster for i in s] for poster in st]
put back into DataFrame with poster names
df2 = pd.DataFrame(subtokens, index=df[0]).stack()

In [250]: df2
Out[250]: 
jim   0           yolo_omg
      1               yolo
      2                omg
jack  0            yes_omg
      1    best_place_ever
      2                yes
      3                omg
      4               best
      5              place
      6               ever
neil  0      rofl_so_funny
      1               rofl
      2                 so
      3              funny
dtype: object
Leaf
  • 21
  • 3