0

The data set I am working on(4k people) have body heights entered like 1.70, 170 and 1,70. How can I make them uniform so that I can calculate body mass index.

  • Assuming you have data like `heights = ['1.70', '170', '1,70']` and just want a uniform notation (e.g. in cm); you could just do `uniform_heights = [x.replace(',', '').replace('.', '') for x in heights]`. This will result in `uniform_heights = ['170', '170', '170']`. – trotta Jul 12 '19 at 15:38
  • Please first make clear if these are strings or numbers in an excel file. – ilias iliadis Jul 12 '19 at 16:01

1 Answers1

0

Data cleaning is an art.

Of course how to do it, depends on what format your data is in.

If you have the data in Python already, say as one record (or any array of records that you can iterate over), and each record is either an array or dict of fields, then pick out the field you want to fix, and fix it, then write the data back out to whatever format it originally came from.

I'm guessing from you example, that you want to interpet "1.70" as meters, "170" as centimeters, and "1,7" as a typo that you'll take as 1.7 meters.

I'm also assuming that you want to make them all uniform as meters (you could just as well use any other unit, or course).

Let's say you've got the "height" field in a variable named 'h'. Presumably it's a string, or else you couldn't have "1.7". You could do:

h = re.sub(",", '.', h)  # You could do other fixes, too.
h = float(h)
if (h > 100): h = h/100.0  # Very few people are over 100m tall...

Of course, if these are building heights instead of person heights, you might want to adjust...

A very important step if you care much about the data, is to really scan it with your own eyes to look for any other anomalies. There almost always are many more than you expect. Generating lists of all the unique values for each field, and how often each occurs, is a good way to do this quickly, because errors will usually be toward the low-frequency end of the list.

TextGeek
  • 1,196
  • 11
  • 23