0

I am trying to match the strings in a dataset with jaro distance. The problem is I am getting strings with white spaces as matches. Here is the data:

df1 <- data.frame(ID1=c("london.inc","USA","UK","ball"," "),ID2=c("london.in","US","UKS","bull"," "), x=c(1:5))
library(stringdist)
df1$jwdist<-stringdist(df1$ID1,df1$ID2,method='jw',useBytes=TRUE,p=0)
y <- subset(df1,df1$jwdist<.2)

     ID1       ID2 x     jwdist
1 london.inc london.in 1 0.03333333
2        USA        US 2 0.11111111
3         UK       UKS 3 0.11111111
4       ball      bull 4 0.16666667
5                      5 0.00000000

Is there a way I can handle the matches which are of white spaces? I am expecting output like this.

  ID1       ID2 x     jwdist
1 london.inc london.in 1 0.03333333
2        USA        US 2 0.11111111
3         UK       UKS 3 0.11111111
4       ball      bull 4 0.16666667
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
user3570187
  • 1,743
  • 3
  • 17
  • 34
  • Can't you just remove these rows before running `stringdist`? For example, `df1 <- subset(df1, rowSums(df1 == " ") < (ncol(df1) - 1L))` – David Arenburg May 27 '15 at 19:28
  • I am merging two different data sets. Both of which have a lot of variables >10 of strings and numbers. I tried issuing the command and it is taking all the white spaces in the data frame. I just want to strip white space from one column. – user3570187 May 27 '15 at 19:37
  • So strip the white spaces from that column? – David Arenburg May 27 '15 at 19:38
  • If i strip of the white space still the jw distance will be 0. I need to match the columns for cross checking the data inconsistencies. Thanks! – user3570187 May 27 '15 at 19:46
  • I don't understand you. Your desired output strpis the rows with both white spaces completely. Can you provide an example where my suggestion doesn't work? BTW, you can modify my code to just `df1 <- subset(df1, rowSums(df1[1:2] == " ") < 2) ` for example. – David Arenburg May 27 '15 at 19:53

0 Answers0