0

I have a csv file(or dataframe) like below :

Text    Location    State
A   Florida, USA    Florida
B   NY              New York
C       
D   abc 

And a dictionary with key value pair as :

stat_map = {
        'FL': 'Florida',
        'NY': 'NewYork',
        'AR': 'Arkansas',
}

How may I delete row 3rd and 4th i.e. row with Text C & D so that my dataframe contains only those rows for which i have value in dictionary. The final output should look like :

Text    Location    State
    A   Florida, USA    Florida
    B   NY              New York

Please help.

Alex
  • 81
  • 4
  • 10
  • Hi (again) @Ritika, you should always add some sample data to illustrate your question (like data you [mentioned previously](https://stackoverflow.com/questions/46748369/how-to-split-dataframe-column-into-two-parts-and-replace-column-with-splitted-va/46749556?noredirect=1#comment80464195_46749556)). – FabienP Oct 15 '17 at 20:30
  • Hi Fabien, I am new to Python and stack overflow, sorry for these problems. I updated my post. What I am actually doing is that i am downloading some tweets in a json file, extracting text and location from the json file into a dataframe. Now I further need to aggregate all tweets location wise and perform sentiment analysis and topic modelling location wise. The dataframe location column has values like 'CA', 'USA', ' XYZ, CA', 'CA, USA', 'California'. Some are also blank and some are random values. Could you please help me with the sentiment analysis and topic modelling state wise. – Alex Oct 15 '17 at 20:45
  • The code i posted above is to replace different formats of location to one particular location format for all tweets. But this gives error. Further i need to do sentiment analysis and topic modelling on those tweets. – Alex Oct 15 '17 at 20:54
  • You just replaced this question with a new one after it had been answered. Moreover it is duplicate of the [new answer](https://stackoverflow.com/questions/46784707/python-deleting-rows-from-dataframe-for-which-value-does-not-exist-in-dictionar) you opened. I don't understand why you did that, it is very bad practice, the answer has no context anymore and won't serve to anyone. – FabienP Oct 17 '17 at 17:55

1 Answers1

2

What you're looking for is pandas.Series.map(), which replaces a value by that provided in a mapper, here states_map.

I will reuse data from you previous question for illustration

import pandas as pd

states_map = {
        'AK': 'Alaska',
        'AL': 'Alabama',
        'AR': 'Arkansas',
        'CA': 'California',  # Enrich the dict for the current example
        'NY': 'New York'     # Same as above
}

>>> df
Out[]:
               State
0    California, USA
1  Beverly Hills, CA
2         California
3                 CA
4            NY, USA
5                USA

Using the discussed method with map will give

states = df['State'].str.split(', ').str[0]

>>> states
Out[]:
0       California
1    Beverly Hills
2       California
3               CA
4               NY
5              USA
Name: State, dtype: object

>>> states.map(states_map)
Out[]:
0           NaN
1           NaN
2           NaN
3    California
4      New York
5           NaN
Name: State, dtype: object

But this is not optimal, as you loose information from row 1 with the split and from rows 0 and 2 with the map.

I think it can be done better like this:

Get all terms from split using expand=True

df_parts = df.State.str.split(', ', expand=True)

>>> df_parts
Out[]:
               0     1
0     California   USA
1  Beverly Hills    CA
2     California  None
3             CA  None
4             NY   USA
5            USA  None

Get places where the state is correct

mask = df_parts.isin(states_map.values())

>>> df_parts[mask]
Out[]:
            0    1
0  California  NaN
1         NaN  NaN
2  California  NaN
3         NaN  NaN
4         NaN  NaN
5         NaN  NaN

Using ~ (bitwise NOT) gives us the inverse of the mask.

df_unknown = df_parts[~mask]

>>> df_unknown
Out[]:
               0     1
0            NaN   USA
1  Beverly Hills    CA
2            NaN  None
3             CA  None
4             NY   USA
5            USA  None

Use map where state is not known

>>> df_unknown.apply(lambda col: col.map(states_map))
Out[]:
            0           1
0         NaN         NaN
1         NaN  California
2         NaN         NaN
3  California         NaN
4    New York         NaN
5         NaN         NaN

And set these values in masked df_parts

df_parts[~mask] = df_unknown.apply(lambda col: col.map(states_map))

>>> df_parts
Out[]:
            0           1
0  California         NaN
1         NaN  California
2  California         NaN
3  California         NaN
4    New York         NaN
5         NaN         NaN

Reunify values

>>> df_parts[0].fillna(df_parts[1])  # Fill blanks in col 1 with values in col 2
Out[]:
0    California
1    California
2    California
3    California
4      New York
5           NaN
Name: 0, dtype: object

Replace curated values in original dataframe

df['State_new'] = df_parts[0].fillna(df_parts[1])

>>> df
Out[]:
               State   State_new
0    California, USA  California
1  Beverly Hills, CA  California
2         California  California
3                 CA  California
4            NY, USA    New York
5                USA         NaN

It may not be a perfect approach, but hope it will help.

FabienP
  • 3,018
  • 1
  • 20
  • 25
  • Hey Fabien, Thanks for your reply dear. But mask = df_parts.isin(states_map.values()) – Alex Oct 15 '17 at 22:27
  • I manually removed errors and it worked. But how can I use strip or any other function in above code so that it works fine. – Alex Oct 15 '17 at 23:09
  • The state column values have spaces as i extracted them from a json file and saved in dataframe. So wherever spaces are there, the logic shared by you does not work, If there are no spaces, it works perfectly. Could you please advise how can i remove spaces. Tried using strip function while storing values in State as df['Location'] = list(map(lambda t:t['user']['location'].strip(), data)) but get error that 'NoneType' object has no attribute 'strip' – Alex Oct 15 '17 at 23:22
  • Hey Fabien, it worked. Really appreciate your help. Thank you so so much. I got the dataframe now with the text and state column. Would you mind helping me further that how can I do sentiment analysis and topic modelling state wise now. I dont know how to structure the loop further now to do these state wise. – Alex Oct 15 '17 at 23:47
  • @Ritika, you're welcome, glad it worked. You should try a new question if you still need it. – FabienP Oct 16 '17 at 05:24
  • Hi Fabien, Need one more help please related to this issue. As I stored the vales in State column of dataframe from a json file, I have a long list of some state values are very random like 'XYZ' or blank which does not even exist. How should I remove such values further from my dataframe. I am updating this post for this problem where I am stuck now so that you could understand the issue and possibly help me one more time – Alex Oct 17 '17 at 07:14
  • @Ritika I suggest you address this issue in a new question, with a sample of your df, an your desired output as an example. Then post the link here in comment so I will see it for sure. I will have a look. – FabienP Oct 17 '17 at 07:41