want to group categorical values in a column

Question

I am trying to group & assign a numeric value to a column 'neighborhood' having values like: #Queens#Jackson Heights#, #Manhattan#Upper East Side#Sutton Place#, #Brooklyn#Williamsburg#,#Bronx#East Bronx#Throgs Neck#. (Values have 2,3 sometimes 4,5 hashtags) I used normal if else loop, which worked fine for first 3 values, as given in the image attached. But m not sure if its working right. Please help me group and assign values to then those groups. [the if else loop i used is as below: *

*# Create a list to store the data
grades = []
# For each row in the column,
for row in new_train1['neighborhood']:
    # if more than a value,
    if row > '#Queens#':
        # Append a num grade
        grades.append('1')
    # else, if more than a value,
    elif row > '#Manhattan#':
        # Append a letter grade
        grades.append('2')
    # else, if more than a value,
    elif row > '#Bronx#':
        # Append a letter grade
        grades.append('3')
    # else, if more than a value,
    elif row > '#Brooklyn#':
        # Append a letter grade
        grades.append('4')
    # else, if more than a value,
    else:
        # Append a failing grade
        grades.append('0')

] [1]: https://i.stack.imgur.com/iQ3E8.png

Your question is unclear. What are your inputs and expected output? Please supply a [mcve]. — jpp, Jun 18 '18 at 17:36
further, what do you mean by row> '#Manhattan'? Not sure how can you condition that way. — skrubber, Jun 18 '18 at 17:48
You want to assign a code for the set of Queens, Manhattan, Bronx, and Brooklyn? Are they guaranteed to always be first? — Kyle, Jun 18 '18 at 17:53
@Rucha: If I understood correctly, first you have to split `str` by `#` — ramesh, Jun 18 '18 at 18:27
@Kyle yes, the values always appear as given.. #Queens, #Manhattan..,#Bronx.. Also, inputs is the column neighborhood that has values as mentioned above and i want to assign numeric values to them (for ex., all the areas that start with #Queens should have value "1", all starting with #Manhattan.. should have "2" , #Bronx..= 4) — Rucha, Jun 19 '18 at 18:12
@ramesh I see what you said, but the if else loop I have in my main question incorrect? Why does it work for only first 2 values and then gives numeric value "3" for values starting with "#Brons..." & "#Brooklyn.." (please see the image attached with main question for reference) Thanks — Rucha, Jun 19 '18 at 18:18
@jpp I want to assign numeric values like- for values starting with "#Queens#.." should have numeric value "1", values starting with "#Manhattan"=2, "Bronx#.."=3, "#Brooklyn#.."=4 (plz see the image for reference) — Rucha, Jun 19 '18 at 18:27
@skrubber Even i am not sure if i can condition the way I've done in the if else loop.. But it works for 1st two values (see the image attached in the main question) And it assigns numeric value 3 for values starting with "#Bronx.." & "#Brooklyn.." both! I want to know why it works for Queens & Manhattan & not for other two neighborhoods! And how to actually do that. — Rucha, Jun 19 '18 at 18:33
couple of things. never paste images; hard to replicate, so post data. help is here: [cvm](https://stackoverflow.com/help/mcve). And then, what is the data in the series (new_train1['neighborhood'])? — skrubber, Jun 19 '18 at 18:40
@skrubber Noted. new_train1 is the dataset (csv format) and ['neighborhood'] is the name of the column — Rucha, Jun 19 '18 at 19:36

ramesh · Answer 1 · 2018-06-19T22:36:08.780

0

Please avoid pasting images and testing typing skills. If I understood your problem correctly, I'd do something like this

#creating data frame
df = pd.DataFrame({"A":[1,2,3,4,5], "B":["#Queens#Jackson Heights#", "Manhattan#Upper East Side#Sutton Place#", "Bronx#West East Side#", "Manhattan#Upper East Side#", "#Manhattan#Downtown#Chelsea"]})
#creating replacement dictionary
replace_dic = {"Queens":1, "Jackson Heights":2, "Manhattan":3, "Upper East Side":4, "Sutton Place":5,
              "Bronx":6, "West East Side":7, "Downtown":8, 'Chelsea':9}
#replacing
df['C'] = df['B'].str.split("#").apply(lambda x: [replace_dic[i] for i in x if i != ''])
#result
    A   B   C
0   1   #Queens#Jackson Heights#    [1, 2]
1   2   Manhattan#Upper East Side#Sutton Place#     [3, 4, 5]
2   3   Bronx#West East Side#   [6, 7]
3   4   Manhattan#Upper East Side#  [3, 4]
4   5   #Manhattan#Downtown#Chelsea     [3, 8, 9]

Based on your comments I think you are looking for something like this

def replacefunc(x):
    x = [i for i in x if i != '']
    return replace_dic[x[0]]
df['D'] = df['B'].str.split("#").apply(replacefunc)

edited Jun 19 '18 at 22:36

answered Jun 19 '18 at 22:08

ramesh

1,187
7
19
42

Thank you so much. Got to understand how to create dictionary. – Rucha Jun 21 '18 at 21:50
It depends! If you have so many values and you are okay to assign any random integer, simply you can iterate over unique values of the column – ramesh Jun 22 '18 at 01:56
sure. Thank you so much @ramesh – Rucha Jun 25 '18 at 15:37
by any chance would you be able to help with another question that i posted regarding the same problem? question topic: Error: IndexError: list index out of range – Rucha Jun 25 '18 at 15:48
It would be difficult for me to search for your question. Share the link here. Perhaps they are other people here to help you out. – ramesh Jun 25 '18 at 21:34
https://stackoverflow.com/questions/50977466/error-indexerror-list-index-out-of-range?noredirect=1#comment88951066_50977466 – Rucha Jun 26 '18 at 16:16

score 0 · Answer 2 · answered Jun 21 '18 at 21:48

Thank you all for your help & inputs. I removed the hashtags by simple split. & then used for loop to count just the 1st word in each row. It gives me the expected output but an index out of range error, but I'm working on it. The code is as below:

train = pd.DataFrame(train, columns = ['id','listing_type','floor','latitude','longitude','price','beds','baths','total_rooms','square_feet','pet_details','neighborhood'])
    # Create a list to store the data
    grades = []

    # For each row in the column,
    for row in train['neighborhood'].str.split('#'):
        # if more than a value,
        if row[1] == 'Queens':
            # Append a num grade
            grades.append('1')
        # else, if more than a value,
        elif row[1] == 'Manhattan':
            # Append a letter grade
            grades.append('2')
        # else, if more than a value,
        elif row[1] == 'Bronx':
            # Append a letter grade
            grades.append('3')
        # else, if more than a value,
        elif row[1] == 'Brooklyn':
            # Append a letter grade
            grades.append('4')
        # else, if more than a value,
        else:
            # Append a failing grade
            grades.append('0')

`

want to group categorical values in a column

2 Answers2