-1

I am trying to match fields in 2 separate datasets. They are both address fields. One data set may contain something like "532 Sheffield Dr" and the other may contain only "Sheffield Dr". Another example is "US21 Ramp and Hays RD" with "US 21", "N 25th St and Danville RD" with "25th St" and so on. So basically, all the text/numbers in the column in the second dataset should match with that of the first dataset even though the data in the first dataset might contain some extra text/numbers. I have been trying to use RegEx but haven't been able to figure out the appropriate code for it. How do I go about this?

UninformedUser
  • 8,397
  • 1
  • 14
  • 23
Cyclops
  • 17
  • 6
  • 2
    Welcome to StackOverflow! Please read [How do I ask a good question](https://stackoverflow.com/help/how-to-ask) – Pedro Lobito Apr 20 '17 at 15:52
  • Can you please provide some extra detail? Are you using python lists? numpy arrays? Pandas DataFrames? – billett Apr 20 '17 at 15:53
  • @billett Unfortunately I am very new to coding. I have been searching for solutions to this and I came across 'pyparsing'. I have been trying to use 'https://regex101.com/' to come up with a relevant code. – Cyclops Apr 20 '17 at 15:58
  • @roganjosh Seems like I am doing something similar to what you are doing. Can you elaborate a little more on how you are moving forward? – Cyclops Apr 20 '17 at 15:59
  • Ehm, why is it tagged with "sparql"?! – UninformedUser Apr 20 '17 at 17:54

1 Answers1

0

Based on your examples and what I understood the easiest way is something like:

s1 = ["532 Sheffield Dr",  "US21 Ramp and Hays RD",  "N 25th St and Danville RD"]
s2 = ["Sheffield Dr",  "US 21", "25th St"]

for item2 in s2:
    for item1 in s1:
        if item2 in item1 or item2.replace(' ', '') in item1:
            print('%s in %s' % (item2, item1))
TitanFighter
  • 4,582
  • 3
  • 45
  • 73