0

I'm attempting to scrape and clean wikipedia data. I have a data field that contains dimensions as shown below.

["112 x 76 yards (102.4m x 69.4m)", "104.5 x 70.3 m", "107m x 72m", 
 "109×73 yds / 100×67 m", "{{convert|105|x|68|m|yd|1}}", "100 metres by 70 metres"]

Extracting the dimensions is easy enough, but extracting the unit is rather difficult given how many variations of entries there are. What is the best way to approach this?

I have started by using;

"(\d+\.?\d*)"

Which should extract all the dimensions, I was then going to save only the first 2 numerical matches, save the first match of a unit('m','metre','metres','y','yard','yds','yd','ft'.....) and then I can convert all to metres later.

I am just unsure about how I would go about saving the first unit match.

MC101
  • 87
  • 5
  • 1
    You could supply an array of the unit strings you want to match and match against it, then hash by index with the dimensions. – Lane Terry Jun 10 '18 at 20:11
  • 1
    possibly helpful or related [Regular expression extracting number dimension](https://stackoverflow.com/questions/44555769/regular-expression-extracting-number-dimension) – chickity china chinese chicken Jun 10 '18 at 20:22
  • [Related](https://pint.readthedocs.io/en/latest/index.html). Also, if parsing is ambiguous, you can use the implied conversion rate to hopefully rule out some options. – hilberts_drinking_problem Jun 10 '18 at 20:40

0 Answers0