Basically the function just checks if ever some string i from a dataframe which consists of a column of name
, is found in each of the list and then it returns that value. the first loop is responsible for checking the text distance which makes the process run for hours, is there any way for this to be converted through vectorization for example for this to run faster?
def check_segment(name):
name = re.sub('\W+', ' ', name)
name = re.sub('\*', '', name)
for i in top_1k:
x = textdistance.jaro_winkler(i, name)
if x > 0.8:
return 'Top 1K'
for i in school:
if i in name:
return 'School'
for i in hospital:
if i in name:
return 'Hospital'
for i in govt:
if i in name:
return 'Government'
for i in coop:
if i in name:
return 'Cooperative'
for i in banks:
if i in name:
return 'Bank'
for i in sarisari:
if i in name:
return 'Sari-Sari Store'
for i in malls:
if i in name:
return 'Malls'
for i in remittance_center:
if i in name:
return 'Remittance Center'
for i in hotel:
if i in name:
return 'Hotels'
for i in foundation:
if i in name:
return 'Foundation'
for i in embassy:
if i in name:
return 'Embassy'
return 'SME'
lists are created wherein if a certain word from the string matches a word from the list, you will be label:
school = ["UNIVERSITY","ACADEMY","COLLEGE","ACADEME","SCHOOL","MONTESSORI","ELEMENTARY","HIGH SCHOOL","COLLEGIO","INSTITUTE"]
hospital = ["HOSPITAL","LABORATORY","CLINIC","MEDICAL","DIAGNOSTIC","HEALTH","DOCTOR", "HEALTHCARE"]
govt = ["DEPARTMENT OF","CITY GOVERNMENT","OFFICE OF THE","PROVINCE OF","PROVINCIAL","CITY TREASURER","REGISTRY OF","REGISTER OF",
"BUREAU OF","MUNICIPAL","COMMISSION","PEZA","HDMF","WATER DISTRICT","HOME DEVELOPMENT MUTUAL FUND","CLERK OF COURT",
"CITY OF","BARANGAY", "GOVERNMENT"]
coop = ["COOP", "COOPERATIVE"]
hotel = ["HOTEL","RESORT", "CONDOTEL", "HOTELIERS", "INN"]
foundation = ["FOUNDATION"]
embassy = ["EMBASSY"]
df['segment'] = df['name'].apply(check_segment)
The input dataframe is:
Name |
---|
WORLD FOUNDATION |
SUNNY RESORT |
COOPERATIVE SOCIETY |
CITY GOVERNMENT OF PLAZA |
COLLEGE OF MUSIC |
After applying the function, the output dataframe is
Name | Segment |
---|---|
WORLD FOUNDATION | Foundation |
SUNNY RESORT | Hotels |
COOPERATIVE SOCIETY | Cooperative |
CITY GOVERNMENT OF PLAZA | Government |
COLLEGE OF MUSIC | School |