0

I´m currently doing a project for university in which I need to evaluate a dataset from Kaggle: enter image description here

My problem is pretty simple, but I just couldn´t figure it out by researching: How can I make a comparision if the salary is higher or lower than 50K in Python? The problem is in the line of the 'if-clause'. It always shows me this error: IndexError: string index out of range

Thank you very much for helping me out!

import csv

with open('C:/Users/jkhjkh/Google Drive/Big data/adult.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')

y = 0
z = 0

ages = []
maritalstatuss = []
races = []
sexes = []
hoursperweeks = []
incomes = []

for row in readCSV:         # 4th row extracts
    age = row[0]            # '54'
    maritalstatus = row[5]  # 'Divorced'
    race = row[8]           # 'White'
    sex = row[9]            # 'Female'
    hoursperweek = row[12]  # '40'
    income = row[14]        # '<=50K'

    ages.append(age)
    maritalstatuss.append(maritalstatus)
    races.append(race)
    sexes.append(sex)
    hoursperweeks.append(hoursperweek)
    incomes.append(hoursperweek)

print(len(ages))

for x in range(1,len(ages)): 
    if ages[x] > '40' and ages[x] < '66' and income[x] < '50K':
        y = y + 1

print(y)
cchamberlain
  • 17,444
  • 7
  • 59
  • 72
Andy89
  • 21
  • 4

2 Answers2

0

I believe you're mistaken by doing string comparison, though you intend to do age (number) and income (number) comparison.

if (ages[x] > 40 and ages[x] < 66) and income[x] < 50000:

make sure those (age and income) python lists are numeric. Use the conversion method. Let me know if this works.

vsr
  • 1,025
  • 9
  • 17
  • Thank you very much for your answer. I found out, the problem by coincidence. The problem was that the variable should have been named "incomes[x]" instead of "income[x]". Then it works :) – Andy89 Nov 13 '16 at 10:09
  • OK. Missed that. However, does the string comparison works correctly (if loop) after changing it to "incomes[x]" ? – vsr Nov 13 '16 at 10:14
0

List in Python is 0-origin, thus the value of for-loop index x exceeds the index range of ages list when it reach to len(ages). This kind of error is known as off-by-one error.

import csv

with open('C:/Users/jkhjkh/Google Drive/Big data/adult.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')

y = 0
z = 0

ages = []
maritalstatuss = []
races = []
sexes = []
hoursperweeks = []
incomes = []

for row in readCSV:
    age = int(row[0])
    maritalstatus = int(row[5])
    race = row[8]
    sex = row[9]
    hoursperweek = (row[12])
    income = row[14]

    ages.append(age)
    maritalstatuss.append(maritalstatus)
    races.append(race)
    sexes.append(sex)
    hoursperweeks.append(hoursperweek)
    incomes.append(hoursperweek)

print(len(ages))

for x in range(1, len(ages) - 1): 
    if ages[x] > 40 and ages[x] < 66 and incomes[x] == '<=50K':
        y = y + 1

print(y)

Besides adjust the loop index range, values of age, maritalstatus and hoursperweeks will be read as int now. Result of numeric comparison as str differs one as int. (e.g. '3' < '10' is False, but 3 < 10 is True.)

cocoatomo
  • 5,432
  • 2
  • 14
  • 12