Having a dataFrame column hosting a compound name (school name), I am interested to go over each word of the school name and find its position with respect all the dataFrame column. I used the following approach:
Create another column which is a column and iterate for each word using startswith(word) and endswith(word) for each DataFrame row. This approach works and provides the correct result in small datasets:
Data=[[0,'ABC SCHOOL BOARDING',['ABC','SCHOOL','BOARDING']],
[1,'UNIVERSITY BOARDING INSTITUTE',['UNIVERSITY','BOARDING','INSTITUTE']],
[2,'MARIE INSTITUTE SCHOOL',['MARIE', 'INSTITUTE','SCHOOL']],
[3,'RALPH ELEMENTARY SCHOOL',['RALPH','ELEMENTARY','SCHOOL']],
[4,'BOARDING SCHOOL',['BOARDING','SCHOOL']]]
df=pd.DataFrame(Data, columns=['id','name', 'name_list'])
df_allWords = pd.DataFrame(columns=['word','classification','counter'])
df2 = pd.DataFrame(columns=['word','classification'])
for index, row in df.iterrows():
for word in row.name_list:
df_start=df[(df['name'].str.startswith(word))]
start_length=len(df_start)
df_end=df[(df['name'].str.endswith(word))]
end_length=len(df_end)
if (start_length>0):
df2=df2.append(pd.DataFrame({'word':[word], 'classification':['start']}))
if (end_length>0):
df2=df2.append(pd.DataFrame({'word':[word], 'classification':['end']}))
if (end_length<1 and start_length<1):
df2=df2.append(pd.DataFrame({'word':[word], 'classification':['none']}))
print(df)
print("----")
print(df2)
print("----")
count_series=df2.groupby(['word','classification']).size()
df_count = count_series.to_frame(name = 'counter').reset_index()
df2 = df_count
df2
I need to go over each word for row in the DataFrame to find the position of that word with respect to ALL rows in the DataFrame. Then, after running the above snippet, you can get the following:
word classification counter
0 ABC start 1
1 BOARDING end 3
2 BOARDING start 3
3 ELEMENTARY none 1
4 INSTITUTE end 2
5 MARIE start 1
6 RALPH start 1
7 SCHOOL end 4
8 UNIVERSITY start 1
However, if I apply the previous approach in half million dataframe rows, it delays forever. I have not been able to finish the process after more than 8 hours.
What should be the best approach to avoid the iteration to go over row.name_list
? Is there another approach I should look at?
Update
Second Approach: Added apply to run a function for each row
Consider the same dataframe:
Data=[[0,'ABC SCHOOL BOARDING',['ABC','SCHOOL','BOARDING']],
[1,'UNIVERSITY BOARDING INSTITUTE',['UNIVERSITY','BOARDING','INSTITUTE']],
[2,'MARIE INSTITUTE SCHOOL',['MARIE', 'INSTITUTE','SCHOOL']],
[3,'RALPH ELEMENTARY SCHOOL',['RALPH','ELEMENTARY','SCHOOL']],
[4,'BOARDING SCHOOL',['BOARDING','SCHOOL']]]
df=pd.DataFrame(Data, columns=['id','name', 'name_list'])
df_class = pd.DataFrame(columns=['word','classification'])
I attempted to solved the stated problem by having already two dataframes, one for the starting words and the other for end words removing duplicates. Then, in a function I iterate over the column holding the list for the name and compare to both dataframes (start, end):
def classify(row, df_start, df_end):
df = pd.DataFrame(columns=['word','classification'])
for word in row.name_list:
flag=False
if word in df_start.values:
df=df.append(pd.DataFrame({'word':[word], 'classification':['start']}))
flag=True
if word in df_end.values:
df=df.append(pd.DataFrame({'word':[word], 'classification':['end']}))
flag=True
if (not flag):
df=df.append(pd.DataFrame({'word':[word], 'classification':['none']}))
return df
Then return a dataframe including the word and classification.
This function is called in the following manner:
df_start=pd.DataFrame(columns=['name'])
df_end=pd.DataFrame(columns=['name'])
df_start= df.name.str.split().str.get(0).drop_duplicates(keep="last")
df_end= df.name.str.split().str.get(-1).drop_duplicates(keep="last")
df_class = df.apply(classify, args=[df_start, df_end],axis=1)
However, the dataframe made of two columns (word, classification) is received as series instead of a dataframe. I was wondering what is the correct way to receive the dataframe generated for each row and add them into a single Dataframe when using apply.
After this, I will group by the Dataframe by word and classification to add a counter.
Thanks for your invaluable contribution :)