4

I have a dataframe consisting of a column of strings. I want to extract the numerical numbers of these strings. However, some of the values are in metres, and some in kilometres. How do i detect that there is a "m" or "km" beside the number, standardize the units then extract the numbers to a new column?

details                 numbers
Distance                350m
Longest straight        860m
Top speed               305km
Full throttle           61 per cent

Desired output:

details                 numbers
Distance                350
Longest straight        860
Top speed               305000
Full throttle           61
doyz
  • 887
  • 2
  • 18
  • 43

1 Answers1

9

Use:

m = df['numbers'].str.contains('\d+km')
df['numbers'] = df['numbers'].str.extract('(\d+)', expand=False).astype(int)
df.loc[m, 'numbers'] *= 1000 

print (df)
            details  numbers
0          Distance      350
1  Longest straight      860
2         Top speed   305000
3     Full throttle       61

Explanation:

  1. Get mask for km values by contains
  2. Extract integer values and cast to int by extract
  3. Correct km values by multiple

EDIT: For extract floats values change regex in extract by this solution, also last cast to floats:

print (df)
            details      numbers
0          Distance        1.7km
1  Longest straight       860.8m
2         Top speed        305km
3     Full throttle  61 per cent

m =  df['numbers'].str.contains('\d+km')
df['numbers'] = df['numbers'].str.extract('(\d*\.\d+|\d+)', expand=False).astype(float)
df.loc[m, 'numbers'] *= 1000 
print (df)
            details   numbers
0          Distance    1700.0
1  Longest straight     860.8
2         Top speed  305000.0
3     Full throttle      61.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hey @jezrael, thanks! What if the km value contains a decimal? eg(1.7 km)? How do i change this regex expression: '\d+km'? To "\d+(?:\.\d+)+km"? – doyz Apr 07 '18 at 05:58
  • @doyz - Glad can help! Be free upvote solution too. Thanks. :) – jezrael Apr 07 '18 at 06:27
  • This did not work for my columns which are strings that look like: ` Life-Stage Group Arsenic Boron (mg/d) Calcium (mg/d) Chromium Copper (μg/d) \ 0 <= 3.0 y nan g 3 mg 2500 mg nan g 1000 μg 1 <= 8.0 y nan g 6 mg 2500 mg nan g 3000 μg – mLstudent33 Oct 16 '20 at 00:22