converting csv to arff

Question

I am working on a school project for data mining, where we were given CSV data from kaggle (this is how the data looks (2 lines out of 6970)):

4,1970,Female,150,DomesticPartnersKids,Bachelor's Degree,Democrat,,Yes,No,No,No,Yes,Public,No,Yes,No,Yes,No,No,Yes,Science,Study first,Yes,Yes,No,No,Receiving,No,No,Pragmatist,No,No,Cool headed,Standard hours,No,Happy,Yes,Yes,Yes,No,A.M.,No,End,Yes,No,Me,Yes,Yes,No,Yes,No,Mysterious,No,No,,,,,,,,,,Mac,Yes,Cautious,No,Umm...,No,Space,Yes,In-person,No,Yes,Yes,No,Yay people!,Yes,Yes,Yes,Yes,Yes,No,Yes,,,,,,,,,,,,,,,,,No,No,No,Only-child,Yes,No,No
5,1997,Male,75,Single,High School Diploma,Republican,,Yes,Yes,No,,Yes,Private,No,No,No,Yes,No,No,Yes,Science,Study first,,Yes,No,Yes,Receiving,No,Yes,Pragmatist,No,Yes,Cool headed,Odd hours,No,Right,Yes,No,No,Yes,A.M.,Yes,Start,Yes,Yes,Circumstances,No,Yes,No,Yes,Yes,Mysterious,No,No,Tunes,Technology,Yes,Yes,Yes,Yes,No,Supportive,No,PC,No,Cautious,No,Umm...,No,Space,No,In-person,No,No,Yes,Yes,Grrr people,Yes,No,No,No,No,No,No,Yes,No,No,Yes,No,Own,Pessimist,Mom,No,No,No,No,Nope,Yes,No,No,No,Yes,No,Yes,No,Yes,No

and we have to get this to an .arff format for use in weka. I manualy typed the header(107 attributes)

@ATTRIBUTE  user_id  NUMERIC
@ATTRIBUTE  yob      NUMERIC
@ATTRIBUTE  gender   {Male,Female}
@ATTRIBUTE  income   {150,100,75,50,25,10}
@ATTRIBUTE  householdstatus {MarriedKids,Married,DomesticPartnersKids,DomesticPartners,Single,SingleKids}
@ATTRIBUTE  educationlevel {Bachelor's Degree,High School Diploma,Current K-12,Current Undergraduate,Master's Degree,Associate's Degree,Doctoral Degree}
@ATTRIBUTE  party {Democrat,Republican}
@ATTRIBUTE  Q124742 {Yes,No}
@ATTRIBUTE  Q124122 {Yes,No}

and I get this error :

} expected at end of enumeration read token eol

Then I tried to use the weka converter but it gave me an error

Wrong number of values.Read 2,expected 1,read Token[EOL],line 4 Problem encountered at line:3

What Kaggle project? I'll give it a try if I can get the data file. — zbicyclist, Jun 22 '17 at 00:38
[link](https://inclass.kaggle.com/c/can-we-predict-voting-outcomes ) ty for your response — candy, Jun 23 '17 at 12:34

zbicyclist · Answer 1 · 2017-06-24T03:26:56.477

1

Here's what I did: From Kaggle, I downloaded train.csv (5568 instances, highest ID numbeer 6960).

I didn't use the converter -- just loaded it into the Weka Explorer as a CSV file. Some problems and their solution:

Line 3: First instance of "Bachelor's Degree". It did NOT like that single quote ("line 3, read 7, expected 108"). Got rid of all single quotes (using a global replace in a text editor). Then I tried to load it into Weka again.
The file doesn't have a CR (the Enter key on the keyboard) at the end of the last line, which caused an error ("null on line 5569"). I added one, again in a text editor. Then I loaded it into Weka, and took a look at the variables.
YOB (Year of Birth) is missing for about 300 instances, with "NA" filled in. So, it didn't evaluate as either string or numeric. Edited these to be empty cells instead. Then I loaded it into Weka.
And, of course, moved Party to be the class variable (at the end). I did this in Weka.
Saved this as train.arff
Loaded it back in, and it seems to work OK. I generated 51% accuracy with a OneR classifier, but you wouldn't expect a OneR classifier to work well here. I'm sure you can do better.

Note I didn't do any manual typing of headers. That must have taken a while!

Good luck!

edited Jun 24 '17 at 03:26

answered Jun 24 '17 at 03:20

zbicyclist

691
5
10

I still didnt get it to work i tryd your way and i got an error. https://drive.google.com/open?id=0B6ozOhSRitenRzZDNElMUVBSeFk (this is the link of what i did so far, and im getting an error premature end of line). sorry to bother you but can you look at the file and tell me where did i go wrong. – candy Jun 24 '17 at 18:34
When I load the data portion of the arff file into Excel, it goes out to column DD except for a few records. The first one where it doesn't is line 118 -- the error you get (when I repeat it) is in line 119. Is there supposed to be a question mark in that column (and similar columns later in the file)? – zbicyclist Jun 25 '17 at 02:29
i managed to get it right by doing it all again from the beginning enyway thanks alot for the help and if you to see this is what i did https://drive.google.com/open?id=0B6ozOhSRitenZ3VxLWFFcG1IQ1U – candy Jun 25 '17 at 21:40
1

Happy to help. If you wouldn't mind, "accept" my answer above. This will help StackOverflow clear the question as answered. – zbicyclist Jun 26 '17 at 01:15

converting csv to arff

1 Answers1