0

I have a large csv file that i need to take a row of data, one at a time, and score it against a model. I have tried the code below but get an error of "X has 120839 features per sample; expecting 30". I can run the model against the entire dataset and it makes predictions on each row. But i need to do it one line at a time, thank you.

loaded_model = joblib.load('LR_model.sav')
with open(r'fordTestA.csv', "r") as f:

for line in f:
    line = f.readlines()[1:]  ##minus headers
    result = loaded_model.predict(line)

In this scenario, it doesnt seem to split the lines as there is \n after each row. I tried to add

line = line.rstrip('\n')

This gives an error : " 'list' object has no attribute 'rstrip'". Thanks in advance for any feedback.

Bhanuchander Udhayakumar
  • 1,581
  • 1
  • 12
  • 30
user3046660
  • 81
  • 1
  • 1
  • 10
  • Use pandas to read the file and then run the model on the dataframe. – yasin mohammed Dec 08 '17 at 12:08
  • 2
    You are doing `for line in f` so `line` is one line from your file on each iteration of the loop. But in the first line of the loop you do `line = f.readlines()[1:]` which results in `line` being a list of all lines except for the first one. I am guessing you wanted to do something like: `for line in f.readlines()[1:]`? Also everything after the `with` line must be indented – FlyingTeller Dec 08 '17 at 12:10
  • Does the argument in `predict()` need to be a string separated by `','`? Or can it be a list? – pstatix Dec 08 '17 at 12:11
  • with open('filename.csv', 'r') as f: for row in f: column1, column2, etc.. = row.split(',') – Forcetti Dec 08 '17 at 12:14
  • Hi, thanks for the comments. @yasin, i have also tried DATA_SET_PATH = (r'fordTestA.csv') dataset = pd.read_csv(DATA_SET_PATH) for line in datset: but this gives error too, FlyinTeller, thank you, that does the trick but the model is treating this as one feature. Forcetti, thank you for the comment, Each row contains 30 features, which the model is expecting. – user3046660 Dec 08 '17 at 12:21

1 Answers1

1

I'm not familiar with joblib or predict(), but:

import csv

# other code

with open(r'fordTestA.csv', 'r', newline='') as f:
    rows = csv.reader(f, delimiter=',')
    _ = next(rows) # skip headers
    for row in rows:
        line = list(map(float, row)) # convert row of str to row of float
        results = loaded_model.predict(line)
        # or if you need a ',' delimited string
        line = ','.join(row)
        results = loaded_model.predict(row)
pstatix
  • 3,611
  • 4
  • 18
  • 40
  • Hi, thanks for feedback, that splits it ok but i get an error in relation to the predict function here, and the shape of the array – user3046660 Dec 08 '17 at 12:33
  • @user3046660 Again, I am not familiar with `predict()`. Can you please edit your question to show me the error? – pstatix Dec 08 '17 at 12:40
  • Hi, yea, here are the errors. "builtins.TypeError: Cannot cast array data from dtype('float64') to dtype(' – user3046660 Dec 08 '17 at 12:53
  • hi @pstatix. When i print row, each feature is in quotes. is it possible to delete the quotes as this may be the issue – user3046660 Dec 08 '17 at 13:52
  • @user3046660 That's because a CSV reader returns a list of lists containing string elements. Again, because I am unfamiliar with the `predict()` method, I can update my post and keep trying to help you. However, what I have shown is how to iterate over a CSV row by row. – pstatix Dec 08 '17 at 13:54
  • @user3046660 Furthermore, my update to convert the row to a row of `float` from `str` assumes each element in the row **can** be converted to `float`. – pstatix Dec 08 '17 at 13:56
  • This is a row from the csv 39.7485,10.7769,1224,49.0196,0.073947,624,96.1538,0,0,0,0,0,0.015875,366,0,0,1,70,2,0.77,-0.455,752,37.4937,0,646,0,0,0,1,11.5886. unfortunately, this is the new error "builtins.TypeError: float() argument must be a string or a number, not 'list' " – user3046660 Dec 08 '17 at 14:00
  • @user3046660 I made an error, please see the updated conversion line. Please let me know if you get the same error. – pstatix Dec 08 '17 at 14:11
  • thank you for your help. That worked, I got an error about the shape of the array but could fix it by adding line=([line]), for some reason this works. thanks again – user3046660 Dec 08 '17 at 14:35