I have two dataframes with two columns each
- event: column names --> "District" "City"
- villages: column names --> "District" "Village"
"event" dataframe has district names and events happening in its district villages. The "City" column is a sentance data which has village names of corresponding district hidden between sentances.
data in event:
District City
Dst1 Dance program occured in village1 near highway
Dst1 Regional gatherting atvillage2 --> note: "at" and "village2" typo error partial search required !
Dst2 Music showsin village3with famous songs --> note: i am not concerned about "showsin" but "village1with" has village3 name
Dst3 Sunset is pretty nice in these area
.
.
.
.
.
"villages" dataframe has district names and complete villages list under each district
data in villages:
District Village
Dst1 Village1
Dst1 Village2
Dst2 Village1
.
.
.
.
.
Requirements:
- I must extract village names from event$City column and check in "villages" dataframe whether the village name exist. Example: first 3 rows in event$City has village names hidden between statements and i must extract them and cross check in "villages" dataframe
- If no village names found then use the district name in "event" dataframe
- Also, under same district, mulltiple village names start with same name.. partial search requried
- Finally, I will have to find out lat-long geocodes for those villages
Final desired output ("event" must be added with one more column "Village")
event:
District City Village
--------------------------------------------------------------------------------------------
Dst1 Dance program occured in village1 near highway Village1 --> present in df2
Dst1 Regional gatherting atvillage2 Village2 --> present in df2
Dst2 Music showsin village3with famous songs Village3 --> present in df2
Dst3 Sunset is pretty nice in these area Dst3 --> no village name mentioned and not found in df2
I tried below R code..
I tried with grepl and pmatch.. not satisfying result.. can anyone help me to tune my code and get more accurate result
for (i in (1:length(event$City))){
found<-FALSE
if ((!Trim(event$District[i]) == "")&(!Trim(event$District[i]) == "fatehgarh")){
# Pick all the villages from a district
village<-filter(villages, DISTRICT.NAME == Trim(event$District[i]))
# Village column has district name as well. just dropping
data1<-as.data.frame(sapply(village$Area.Name, function(x)
gsub(paste(Trim(event$District[i]), collapse = '|'), '', x)))
data1=data1[(!data1=="")]
# Remove any unwanted special chars from the City sentances
text<-str_replace_all(event$City[i], "[[:punct:]]", "")
len=length(data1[[1]])
print (paste0("i is : ", i))
x=strsplit(text, " ")[[1]]
# Loop through each village
for (j in (1:len)){
y<-paste0(strsplit(as.character(data1[j])," ")[[1]], collapse = " ")
if (!isTRUE(all.equal(y, character(0)))){
bool<-grepl(paste0(x, collapse = "|"), y)
# match<-pmatch(strsplit(y, " ")[[1]], x)
# bool=all(is.na(match))
if (!bool){
event$Village[i]=y
found<-TRUE
count1 = count1 +1
print(paste0("Village - ", y))
break()
}
}
}
if (!found){
event$Village[i]=event$District[i]
count2 = count2 +1
}
}
}
Sample Dataframe:
city=c("Regional priests vising **basavapattana**in karnataka",
"Showman of bollywood visitedhonnali lastweek",
"Reliance has launched jio tower in chennagiri",
"Railway track damaged near sargara",
"Electioncommittee passed strict order to SorabaMLA",
"Rural villages are now provided with WiFi facility"
)
district = c("Davanagere", "Shimoga", "Shimoga", "Shimoga", "Shimoga", "Dharwad")
event = data.frame(city, district)
district=c("Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Davanagere", "Shimoga", "Shimoga", "Shimoga", "Shimoga", "Shimoga")
villages=c("Arasapura", "Attigere", "Avaragolla", "Bada", "Balluru", "Basavanahalu", "**Basavapatna**", "Batlekatte", "Bavihalu", "Belavanur", "Hosanagara", "Sagar", "Shikarpur", "Sorab", "Tirthahalli")
villages = data.frame(district, villages)
Also, if you look closely, the Bold or double star are same, just its a spelling mistake.. but it must chose Basavapatna...
Also, the data size is pretty huge.. need some help in optimization side as well..