0

I have two dataframes with two columns each

  1. event: column names --> "District" "City"
  2. villages: column names --> "District" "Village"

"event" dataframe has district names and events happening in its district villages. The "City" column is a sentance data which has village names of corresponding district hidden between sentances.

data in event:
    District    City
    Dst1        Dance program occured in village1 near highway
    Dst1        Regional gatherting atvillage2  --> note: "at" and "village2" typo error partial search required !
    Dst2        Music showsin village3with famous songs --> note: i am not concerned about "showsin" but "village1with" has village3 name
    Dst3        Sunset is pretty nice in these area
    .
    .
    .
    .
    .

"villages" dataframe has district names and complete villages list under each district

data in villages:
    District    Village
    Dst1        Village1
    Dst1        Village2
    Dst2        Village1
    .
    .
    .
    .
    .

Requirements:

  1. I must extract village names from event$City column and check in "villages" dataframe whether the village name exist. Example: first 3 rows in event$City has village names hidden between statements and i must extract them and cross check in "villages" dataframe
  2. If no village names found then use the district name in "event" dataframe
  3. Also, under same district, mulltiple village names start with same name.. partial search requried
  4. Finally, I will have to find out lat-long geocodes for those villages

Final desired output ("event" must be added with one more column "Village")

event:
    District    City                            Village
--------------------------------------------------------------------------------------------
    Dst1        Dance program occured in village1 near highway      Village1 --> present in df2
    Dst1        Regional gatherting atvillage2              Village2 --> present in df2
    Dst2        Music showsin village3with famous songs         Village3  --> present in df2
    Dst3        Sunset is pretty nice in these area         Dst3 --> no village name mentioned and not found in df2

I tried below R code..

I tried with grepl and pmatch.. not satisfying result.. can anyone help me to tune my code and get more accurate result

for (i in (1:length(event$City))){
  found<-FALSE
  if ((!Trim(event$District[i]) == "")&(!Trim(event$District[i]) == "fatehgarh")){
      # Pick all the villages from a district
      village<-filter(villages, DISTRICT.NAME == Trim(event$District[i]))
      # Village column has district name as well. just dropping
      data1<-as.data.frame(sapply(village$Area.Name, function(x)
                           gsub(paste(Trim(event$District[i]), collapse = '|'), '', x)))
      data1=data1[(!data1=="")]

      # Remove any unwanted special chars from the City sentances
      text<-str_replace_all(event$City[i], "[[:punct:]]", "")
      len=length(data1[[1]])
      print (paste0("i is : ", i))
      x=strsplit(text, " ")[[1]]

      # Loop through each village  
      for (j in (1:len)){
        y<-paste0(strsplit(as.character(data1[j])," ")[[1]], collapse = " ")
        if (!isTRUE(all.equal(y, character(0)))){
         bool<-grepl(paste0(x, collapse = "|"), y)
          # match<-pmatch(strsplit(y, " ")[[1]], x)
          # bool=all(is.na(match))
            if (!bool){
              event$Village[i]=y
              found<-TRUE
              count1 = count1 +1
              print(paste0("Village - ", y))
              break()
          }
        }  
      }
      if (!found){
        event$Village[i]=event$District[i]
        count2 = count2 +1
      }
  }
}  

Sample Dataframe:

city=c("Regional priests vising **basavapattana**in karnataka",
       "Showman of bollywood visitedhonnali lastweek",
       "Reliance has launched jio tower in chennagiri",
       "Railway track damaged near sargara",
       "Electioncommittee passed strict order to SorabaMLA",
       "Rural villages are now provided with WiFi facility"
       )

district = c("Davanagere",  "Shimoga",  "Shimoga",  "Shimoga",  "Shimoga",  "Dharwad")
event = data.frame(city, district)

district=c("Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Davanagere",  "Shimoga",  "Shimoga",  "Shimoga",  "Shimoga",  "Shimoga")
villages=c("Arasapura",  "Attigere",  "Avaragolla",  "Bada",  "Balluru",  "Basavanahalu",  "**Basavapatna**",  "Batlekatte",  "Bavihalu",  "Belavanur",  "Hosanagara",  "Sagar",  "Shikarpur",  "Sorab",  "Tirthahalli")

villages = data.frame(district, villages)

Also, if you look closely, the Bold or double star are same, just its a spelling mistake.. but it must chose Basavapatna...

Also, the data size is pretty huge.. need some help in optimization side as well..

MBaas
  • 7,248
  • 6
  • 44
  • 61
Adarsha Murthy
  • 145
  • 3
  • 13

0 Answers0