-1

Having a dataFrame column hosting a compound name (school name), I am interested to go over each word of the school name and find its position with respect all the dataFrame column. I used the following approach:

Create another column which is a column and iterate for each word using startswith(word) and endswith(word) for each DataFrame row. This approach works and provides the correct result in small datasets:

Data=[[0,'ABC SCHOOL BOARDING',['ABC','SCHOOL','BOARDING']],
      [1,'UNIVERSITY BOARDING INSTITUTE',['UNIVERSITY','BOARDING','INSTITUTE']],
      [2,'MARIE INSTITUTE SCHOOL',['MARIE', 'INSTITUTE','SCHOOL']],
      [3,'RALPH ELEMENTARY SCHOOL',['RALPH','ELEMENTARY','SCHOOL']],
      [4,'BOARDING SCHOOL',['BOARDING','SCHOOL']]]

df=pd.DataFrame(Data, columns=['id','name', 'name_list'])
df_allWords = pd.DataFrame(columns=['word','classification','counter'])

df2 = pd.DataFrame(columns=['word','classification'])

for index, row in df.iterrows():
    for word in row.name_list:
        df_start=df[(df['name'].str.startswith(word))]
        start_length=len(df_start)

        df_end=df[(df['name'].str.endswith(word))]
        end_length=len(df_end)

        if (start_length>0):
            df2=df2.append(pd.DataFrame({'word':[word], 'classification':['start']}))
        if (end_length>0):
            df2=df2.append(pd.DataFrame({'word':[word], 'classification':['end']}))
        if (end_length<1 and start_length<1):
            df2=df2.append(pd.DataFrame({'word':[word], 'classification':['none']}))

print(df)
print("----")

print(df2)
print("----")

count_series=df2.groupby(['word','classification']).size()
df_count = count_series.to_frame(name = 'counter').reset_index()
df2 = df_count

df2

I need to go over each word for row in the DataFrame to find the position of that word with respect to ALL rows in the DataFrame. Then, after running the above snippet, you can get the following:

    word    classification  counter
0   ABC start   1
1   BOARDING    end 3
2   BOARDING    start   3
3   ELEMENTARY  none    1
4   INSTITUTE   end 2
5   MARIE   start   1
6   RALPH   start   1
7   SCHOOL  end 4
8   UNIVERSITY  start   1

However, if I apply the previous approach in half million dataframe rows, it delays forever. I have not been able to finish the process after more than 8 hours.

What should be the best approach to avoid the iteration to go over row.name_list? Is there another approach I should look at?


Update

Second Approach: Added apply to run a function for each row

Consider the same dataframe:

Data=[[0,'ABC SCHOOL BOARDING',['ABC','SCHOOL','BOARDING']],
      [1,'UNIVERSITY BOARDING INSTITUTE',['UNIVERSITY','BOARDING','INSTITUTE']],
      [2,'MARIE INSTITUTE SCHOOL',['MARIE', 'INSTITUTE','SCHOOL']],
      [3,'RALPH ELEMENTARY SCHOOL',['RALPH','ELEMENTARY','SCHOOL']],
      [4,'BOARDING SCHOOL',['BOARDING','SCHOOL']]]

df=pd.DataFrame(Data, columns=['id','name', 'name_list'])
df_class = pd.DataFrame(columns=['word','classification'])

I attempted to solved the stated problem by having already two dataframes, one for the starting words and the other for end words removing duplicates. Then, in a function I iterate over the column holding the list for the name and compare to both dataframes (start, end):

def classify(row, df_start, df_end):
    df = pd.DataFrame(columns=['word','classification'])
    for word in row.name_list:
        flag=False
        if word in df_start.values:
            df=df.append(pd.DataFrame({'word':[word], 'classification':['start']}))
            flag=True
        if word in df_end.values:
            df=df.append(pd.DataFrame({'word':[word], 'classification':['end']}))
            flag=True
        if (not flag):
            df=df.append(pd.DataFrame({'word':[word], 'classification':['none']}))
    return df

Then return a dataframe including the word and classification.

This function is called in the following manner:

df_start=pd.DataFrame(columns=['name'])
df_end=pd.DataFrame(columns=['name'])
df_start= df.name.str.split().str.get(0).drop_duplicates(keep="last")
df_end= df.name.str.split().str.get(-1).drop_duplicates(keep="last")

df_class = df.apply(classify, args=[df_start, df_end],axis=1)

However, the dataframe made of two columns (word, classification) is received as series instead of a dataframe. I was wondering what is the correct way to receive the dataframe generated for each row and add them into a single Dataframe when using apply.

After this, I will group by the Dataframe by word and classification to add a counter.

Thanks for your invaluable contribution :)

John Barton
  • 1,581
  • 4
  • 25
  • 51
  • It looks like `name_list` is just list of words in `name`. Is it intentional or just coincident? – Quang Hoang Dec 12 '19 at 19:08
  • `name_list` is a list of words made of `name`. The original column is just `name`, but I created the list to iterate over each word of the `name` field – John Barton Dec 12 '19 at 19:09
  • So in other words, you just want to find the start and end words in `'name'`? – Quang Hoang Dec 12 '19 at 19:10
  • The problem is that I need to compare startswith()/endswith() for each word in `name` in each row of the DataFrame. – John Barton Dec 12 '19 at 19:11
  • Oh, so the 2nd `SCHOOL` is `end` because it is end of some other rows? Also, if it's end of some row and is start of some other rows, you expect 2 rows in `df2` with `start` and `end` in each? – Quang Hoang Dec 12 '19 at 19:14
  • Exactly, we can have `GEORGE ELEMENTARY SCHOOL` as input. Then, `GEORGE` is checked if startswith('GEORGE) in all dataframe inputs, also checked if endswith('GEORGE') in all dataframe rows. In this case, for `ELEMENTARY` you can find that this word is getting matches with endswith('ELEMENTARY') because there are rows in Dataframe ending with `ELEMENTARY` word. – John Barton Dec 12 '19 at 19:18
  • We need you to edit material clarifications into the main question body, including illustrative examples. – Prune Dec 12 '19 at 19:22
  • Sure, I am working in a better example – John Barton Dec 12 '19 at 19:28
  • I added the full problem with additional code to be clearer. – John Barton Dec 12 '19 at 19:40

1 Answers1

0

I gather that the ultimate purpose of this is to put the first and last word of each school name into a new DF. If so, your name_list is redundant. Also, your logic is doing a lot of df overhead that should be done with simple string operations.

Re-cast your problem: for each string in your list of schools, you want the first and last words, tagged with the school index.

Data=[[0,'ABC SCHOOL BOARDING',['ABC','SCHOOL','BOARDING']],
      [1,'UNIVERSITY CALIFORNIA INSTITUTE',['UNIVERSITY','CALIFORNIA','INSTITUTE']],
      [2,'MARIE JOSEPH HIGH SCHOOL',['MARIE', 'JOSEPH','HIGH','SCHOOL']],
      [3,'RALPH ELEMENTARY SCHOOL DISTRICT',['RALPH','ELEMENTARY','SCHOOL','DISTRICT']],
      [4,'UNIVERSAL SCHOOL ADMINISTRATION',['UNIVERSAL','SCHOOL','ADMINISTRATION']]]

# extract just the school names; I expect that this is the original list
school_list = [school[1] for school in Data]

# Split the name, and extract the first and last words.
# Label with the school's index in the list.
# Yes, this has a redundant "split"; you can make this a loop or vectorized operation.
word_list = [((idx, school_name.split()[0]), (idx, school_name.split()[-1]))
              for idx, school_name in enumerate(school_list)]

print(word_list)

Result so far:

[((0, 'ABC'), (0, 'BOARDING')), ((1, 'UNIVERSITY'), (1, 'INSTITUTE')), ((2, 'MARIE'), (2, 'SCHOOL')), ((3, 'RALPH'), (3, 'DISTRICT')), ((4, 'UNIVERSAL'), (4, 'ADMINISTRATION'))]

You now have a list of pairs, one pair per school. You can flatten this list -- or whatever other organization you truly need -- and then make a data frame.

If it works better for you, perhaps put the first-last portion into a generator, so you build only one item of 500k records.

Prune
  • 76,765
  • 14
  • 60
  • 81