1
import numpy as np
import csv
filename = "a.csv"

def convert(s): 
    s = s.strip().replace(',', '.')
    return str(s)

salary_data = np.genfromtxt(filename,
                     delimiter= ',',
                     dtype=[('year','i8'),('university','U50'),('school','U250'), 
                     ('degree','U250'),('employement_rate_overall','f8'), 
                     ('basic_monthly_mean','f8'),('gross_monthly_mean','i8'), 
                     ('gross_monthly_median','i8'),('gross_mthly_25_percentile','i8'), 
                     ('gross_mthly_75_percentile','i8')], 
                     encoding= None, #avoid having the deprecated warning
                     skip_header=1,
                     missing_values=['na','-'],filling_values=[0],
                     converters={2: convert} ,
                     comments=None)
print(salary_data)

I was trying to load the csv data, but the data is quite dirty as it contains quotation marks/commas inside the some of the value field and causes me an error.

      Some errors were detected!
      Line #5 (got 13 columns instead of 12) 

I was trying to clean the commas by using the converters. However, the code doesn't seem to work. and I tried

      converters={2: lambda s: str(s.replace(',', '.'))}

This is also not working for my cases. I hope to know what is my mistake and thanks for helping! Thank you for those spotting out my mistake! Even I tried to replace the quotation marks the code is not functioning. The text below is the csv file that I am loading.

      year,university,school,degree,employment_rate_overall,employment_rate_ft_perm,basic_monthly_mean,basic_monthly_median,gross_monthly_mean,gross_monthly_median,gross_mthly_25_percentile,gross_mthly_75_percentile
     2013,Nanyang Technological University,College of Business (Nanyang Business School),Accountancy and Business,97.4,96.1,3701,3200,3727,3350,2900,4000
     2013,Nanyang Technological University,College of Business (Nanyang Business 
     School),Accountancy (3-yr direct Honours Programme),97.1,95.7,2850,2700,2938,2700,2700,2900
     2013,Nanyang Technological University,College of Business (Nanyang Business 
     School),Business (3-yr direct Honours Programme),90.9,85.7,3053,3000,3214,3000,2700,3500
     2013,Nanyang Technological University,"College of Humanities, Arts & Social 
     Sciences",Economics,89.9,83.5,3085,3000,3148,3000,2800,3545
     2013,Nanyang Technological University,College of Sciences,Biomedical Sciences 
     **,na,na,na,na,na,na,na,na
     2013,Nanyang Technological University,College of Sciences,Biomedical Sciences 
     (Traditional Chinese Medicine) #,90.7,88.4,2840,2800,2883,2807,2700,3000
     2013,Nanyang Technological University,College of Sciences,Mathematics & Economics 
     **,na,na,na,na,na,na,na,na
     2014,Nanyang Technological University,"College of Humanities, Arts & Social 
     Sciences","Art, Design & Media",80,68,2761,2600,2791,2700,2300,3000
LLynn
  • 19
  • 4
  • is it true that in your original .csv file there are those more or less random line breaks, which split acutal single items into two lines? For example this seems to be the case at the end of `2013,Nanyang Technological University,"College of Humanities,` – fischmalte Nov 30 '20 at 16:54
  • Yes definitely.. Is there anyway to solve it in gen from txt function? – LLynn Nov 30 '20 at 22:35

1 Answers1

1

I imported your file as a .csv and as @fischmalte pointed out there are new lines, for instance in Nanyang Business School.

However this is not causing your error.
In fact, the error Line #5 (got 13 columns instead of 12), is caused by the " of "College of Humanities, Arts & Social Sciences"

The csv reader generates one more column due to that.
Remove them and your error will disappear.

Also, if you use pandas, the " will be handled automatically:

import pandas as pd
df = pd.DataFrame("my_file.csv")

(It will not take care of the line breaker though )

pguardati
  • 229
  • 2
  • 3