0

Situation

I've run Google's NLP sentiment analysis, which returned the column 'sentiment' with key value pairs for magnitude and score, per below:

Sentiment Analysis Results

This is my results, in the sentiment column, for dataframe df03.

index text02 sentiment
01 Max Muncy is great! magnitude: 0.8999999761581421\nscore: 0.8999999761581421
02 The worst Dodger is Max muncy. magnitude: 0.800000011920929\nscore: -0.800000011920929
03 Max Muncy was great, but not so much now. magnitude: 0.4000000059604645\nscore: -0.4000000059604645
04 What a fantastic guy, that Max muncy. magnitude: 0.8999999761581421\nscore: 0.8999999761581421

Goal

I want to split the sentiment column into two columns, titled sentiment - magnitude and sentiment - score, with the column values listed accordingly.

The data format is newline delimited:

magnitude: 0.8999999761581421\nscore: 0.899999…

So I'm trying the Series.str.split method, like this:

df03['sentiment'].str.split(pat="\n", expand=True)

I'm not very familiar with ReGex, but did note that \n represents line feed, so figured that would be the right value to insert for the pat parameter.

The result is that all values are returned NaN.

index 0
01 NaN
02 NaN
03 NaN
04 NaN

I've tried a couple of different things, but none worked. df03['sentiment'].str.split(r"\n", expand=True) df03['sentiment'].str.split(pat=r"\n", expand=True)

I presume the problem is that \ is creating some kind of regex escape that's nullifying the n, but I'm not seeing anything on regexr.com to confirm that.

There's also the issue of splitting out the terms magnitude and score and placing them in the column headers, which I don't know if expand=True would include or not.

Greatly appreciate any input on what I'm doing wrong and where to focus troubleshooting.

doug

APPENDED

Original created dataframe:

index text02
01 Max Muncy is great!
02 The worst Dodger is Max muncy.
03 Max Muncy was great, but not so much now.
04 What a fantastic guy, that Max muncy.

df03['sentiment']

01    magnitude: 0.8999999761581421\nscore: 0.899999...
02    magnitude: 0.800000011920929\nscore: -0.800000...
03    magnitude: 0.4000000059604645\nscore: -0.40000...
04    magnitude: 0.8999999761581421\nscore: 0.899999...
Name: sentiment, dtype: object

APPENDED 02

Ran this

df03['sentiment'].astype(str).str.split(pat=r"\\n| ", expand=True)

Returned this (not sure how to format like tables above)

|index|0|1|2|
|---|---|---|---|
|01|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
|02|magnitude:|0\.800000011920929
score:|-0\.800000011920929
|
|03|magnitude:|0\.4000000059604645
score:|-0\.4000000059604645
|
|04|magnitude:|0\.8999999761581421
score:|0\.8999999761581421
|
dsx
  • 167
  • 1
  • 12
  • Re: the table formatting, when I was finally able to edit your original post yesterday, I saw what the issues were that made them not render properly. The table syntax does not work if it is placed inside a code block, which is either inside three sets of backticks (```), or indented. If you remove those, it will start trying to format things like tables. However, with the last one you have added, you will also have the problem of it not liking the line break. I had to alter your previous one like that to just contain `\n` for the table to render with text in the correct cells. – baileythegreen Mar 31 '22 at 10:08

1 Answers1

3

You need to specify the regular expression like this (with two slashes, and as a raw string):

df['sentiment'].str.split(pat=r"\\n", expand=True)

Here df and df['sentiment'] evaluate to:

df
index text02 sentiment
1 Max Muncy is great! magnitude: 0.8999999761581421\nscore: 0.89999...
2 The worst Dodger is Max muncy. magnitude: 0.800000011920929\nscore: -0.80000...
3 Max Muncy was great, but not so much now. magnitude: 0.4000000059604645\nscore: -0.4000...
4 What a fantastic guy, that Max muncy. magnitude: 0.8999999761581421\nscore: 0.89999...
df['sentiment']
index             
1    magnitude: 0\.8999999761581421\nscore: 0.89999...
2    magnitude: 0\.800000011920929\nscore: -0.80000...
3    magnitude: 0\.4000000059604645\nscore: -0.4000...
4    magnitude: 0\.8999999761581421\nscore: 0.89999...
Name: sentiment, dtype: object

(which I think is your df03).

With those inputs, df['sentiment'].str.split(pat=r"\\n", expand=True) gives:

index 0 1
1 magnitude: 0.8999999761581421 score: 0.8999999761581421
2 magnitude: 0.800000011920929 score: -0.800000011920929
3 magnitude: 0.4000000059604645 score: -0.4000000059604645
4 magnitude: 0.8999999761581421 score: 0.8999999761581421

To rename the columns to Magnitude and Score, and remove those strings from the dataframe, you can modify the regular expression to split on either a line feed, or a space, then rename the columns. Then, selecting only the ones you want to keep, gives:

new = df['sentiment'].str.split(pat=r"\\n| ", expand=True)
new.columns = ["", "Magnitude", "", "Score"]
new[["Magnitude", "Score"]]
index Magnitude Score
1 0.8999999761581421 0.8999999761581421
2 0.800000011920929 -0.800000011920929
3 0.4000000059604645 -0.4000000059604645
4 0.8999999761581421 0.8999999761581421

Addendum

OP had to make some additional tweaks to achieve the result I got. These were using astype(str) to explicitly cast the value to a string, and removing the regex entirely, once they did that:

new = df['sentiment'].astype(str).str.split(expand=True)

By default, Series.str.split() splits on any whitespace, which sounds like the actual inputs being used have some unusual formatting where the last cell contains a line break, but is not expressing it as \n; without actually seeing the original, it's still a bit unclear.

baileythegreen
  • 1,126
  • 3
  • 16
  • Thanks @baileythegreen. What I'm trying to achieve is having only the `magnitude` and `score` numerical values in the columns, with the terms `magnitude` and `score` as the column headers or labels. Is that possible with `Series.str.split`? – dsx Mar 29 '22 at 19:56
  • Hi @dsx, I've updated my answer to show how you could do this. I don't know of a way to do it with just `Series.str.split` as that does not seem to have a way to name columns, and even if it did, also obtaining the new name from within the column would probably not be simple. – baileythegreen Mar 29 '22 at 20:28
  • Thanks @baileythegreen I'm running into something, as the following line is returning a single column with four rows of `NaN`. `df03['sentiment'].str.split(pat=r"\\n| ", expand=True)` I also tried `df03['sentiment'].str.split(pat=r"\\n", expand=True)` with the same result. If the split and RegEx were returning multiple values, I'd think `expand=True` would work, so am guessing it's a problem with the RegEx not identifying a place to split the cell content? Could that be due to `\n` matching a line feed, and not being identified as a text element? – dsx Mar 29 '22 at 22:20
  • @dsx, is there anything in the output column other than the `NaN` values? If I change my test input so that the `score` values are actually physically on a new line, rather than putting `\n` into the file, I end up with columns with `NaN` as every other value. I'm not able to reproduce what you're seeing, I think, which probably means there is a difference in the input I'm using. Have you verified that `df03['sentiment']` returns the values you would expect given your `df03`? Or, does that return `NaN` values, as well? – baileythegreen Mar 29 '22 at 23:20
  • Thx @bailythegreen. There's nothing else in the output other than the `NaN` values. I tested `df03['sentiment']`, and it produces a column with 4 rows, as expected, in newline delimited format, with `\n` separating the magnitude and score key/value pairs. I built the `df03` dataframe from scratch with pandas, adding the text strings and data index. – dsx Mar 30 '22 at 18:20
  • If you are building it from scratch, then can you copy the code for some number of rows into the question? If it doesn't need to pull data from anywhere else, then I can try my suggestion with the same input you have, which will hopefully mean I can reproduce what happens for you. – baileythegreen Mar 30 '22 at 18:44
  • Yes, thanks @baileythegreen. Under the 'APPENDED' section in the original post, I've added the initial `df03` dataframe (the one i originally created), and then the output of `df03['sentiment']` after running the sentiment analysis. – dsx Mar 30 '22 at 21:29
  • 1
    @dsx, from what I've been able to find, `NaN` values are expected when: the value of a cell is `NaN` to start with; the delimiter character doesn't exist in the cell value; or when the cell contents are something other than a string. I wouldn't expect any of these to be true here. That said, I still can't produce the output you're getting, no matter what I try. Can you see if explicitly converting the contents of df['sentiment'] to a string helps? Like this -> `df['sentiment'].astype(str).str.split(pat=r"\\n| ", expand=True)` – baileythegreen Mar 31 '22 at 01:17
  • Yes, that produced an interesting split. I've added it above, in APPENDED 02. Then I tried it with just `df03['sentiment'].astype(str).str.split(expand=True)` and it split them into 4 columns. The RegEx was blocking one of the splits, I guess? I only needed the values, but I can definitely work with those results. Any further issues and I'll post a new question. Greatly appreciate you guiding me through this. – dsx Mar 31 '22 at 02:42
  • 1
    Interesting. The regex is the only way I can get the 4 columns, which sounds like there's some invisible difference between how your input looks, and what it contains; or maybe there is different behaviour between versions of `pandas`, as happens sometime (not that I've found anything about that). The four column result was the intermediate I used in my original reply, then naming the columns and just keeping the ones you wanted. Of course, there are many other ways to eliminate the unwanted text. Glad something finally worked for you! – baileythegreen Mar 31 '22 at 10:14