-2

I am trying to import a CSV file into R to do fraud analysis with linear/logistic regression. What should have been pretty easy is turning complicated... This data set contains 26 variables and more than 2 million rows. I used this command line to import the CSV file:

data <- read.csv('C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv',header=TRUE,sep=";")

Nevertheless, R imported 2.3 million rows in only 1 variable. I attach an image of the View(data) obtained after this step for more information. I have tried switching from sep=";" to sep="," using:

datos <- read.csv('C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv',header=TRUE,sep=",")

But got this error message:

Error in read.table(file = file, header = header, sep = sep, quote = quote,  : 
  more columns than column names

I have tried changing read.csv to read.csv2 (2.3 million rows and 1 variable as result); or using fill=TRUE options (same result), nevertheless the import is not correct. I attach another image of original CSV look opened in Excel. Original CSV in Excel

I appreciate in advance any suggestion or help to fix it.

NuValue
  • 453
  • 3
  • 11
  • 28
  • 3
    Check your seperator. – Avinash Mar 17 '15 at 13:00
  • Hi @Avinash , thanks for your reply. What do you refer exactly by checking the separator? I have used sep=";" and sep="," both. In the first case R imported it as 1 variable and in the second one it showed me the error "more columns than column names" – NuValue Mar 17 '15 at 13:05
  • 1
    Try disabling quotes with `quote = ""` - check the documentation of `?read.table` and `?scan` for other alternatives. I am assuming the issue lies in embedded quotation marks. See my answer [here](http://stackoverflow.com/questions/27321849/read-table-line-15-does-not-contain-23-elements-r/27322317#27322317) for a similar question and answer. – JasonAizkalns Mar 17 '15 at 13:08
  • OK @JasonAizkalns, I'll try editing the code with your answer to the other post. Thanks for the reply. – NuValue Mar 17 '15 at 13:11
  • After adding the parameter quote="" I got the same result of 1 variable and 2.3 million rows. Thanks anyway. – NuValue Mar 17 '15 at 13:25
  • Share the subset of your CSV file (dropbox?), that can reproduce the problem. – zx8754 Mar 17 '15 at 13:41
  • Did you notice that if the column separator is comma (,), then different rows seem to have different numbers of observations? Could it be possible that some columns should contain a comma-separated list of values, and others not? If the comma is used as a value (inside "cells")AND column separator, it could lead to this kind of a problem. Just by examining the file it might not be possible to tell the different fields apart, and then the file can not be unambiguosly imported into R. How was the file generated? –  Mar 17 '15 at 14:33
  • Hi @JTT, thanks for your reply. The file was generated by a third person via SQL extracion from an Oracle DWH and then converted to CSV and sent to me for the data analysis. – NuValue Mar 17 '15 at 14:43
  • 1
    Please don't share your data via anything other than stackoverflow. Remove screenshots and add first three or so rows of your data as a code block in your original post. Also consider using `read.table()`. – Roman Luštrik Mar 17 '15 at 15:52
  • 1
    Thank you and @zx8754 for your replies. The reason why I could not share the file was because of confidentiality of the data. That's why I used the screen shots, even though I know in these cases sharing the file is what it should be done. Thanks. – NuValue Mar 17 '15 at 16:28

2 Answers2

6

Break down the problem into steps that you can check - initially I'd try something like

file <- 'C:/Users/amartinezsistac/OneDrive/PROYECTO/decla_cata_filtrados.csv'
read.csv(file, header=F, skip=1, sep=',', nrow=1)

If this produces a data.frame with 1 row and 26 columns, you're in business, if not, check through the arguments of read.csv again, and see if any of the arguments need changing for your file.

Now progress to

read.csv(file, header=T, skip=0, sep=',', nrow=1)

This should give you the same one line data.frame, but with column names correct - if not check the csv file has the right number of columns in the first row, or carry on skipping the header and assign column names after you've read it in.

Now increase nrow, initially to 10, then maybe by a factor of 10 until you read in the whole file, or you hit a problem. Use a binary search to find the exact line that causes the problem, by setting nrow to be halfway between the value you know works, and one that doesn't until you find the exact problem line.

Refer to the csv in Excel to see what is particular about this line - does it have a strange character, unmatched quotes, fewer entries... this will influence how you fix the problem.

Repeat until your whole file reads in!

Miff
  • 7,486
  • 20
  • 20
  • Thanks @Miff for your time and reply. Your suggestion also worked for this case. As told to xraynaud, the reason was the separator "," misconfused with other variables with comma like "Adress", and others. After changing separator to pipeline "|" R could finally read it. Thank you also for your time and attention. – NuValue Mar 17 '15 at 16:31
2

From the excel screenshot, the first line of data in your file has 31 columns; the second has 29... My guess is that your csv file has a comma for column separator and a comma for decimal separator. You have to reexport your file to csv by making decimal and column separator different.

xraynaud
  • 2,028
  • 19
  • 29
  • Thanks for the reply @xraynaud. The reason why the file was not imported was because the separator "," was misconfused with real estate variables with comma in the cells like "Adress", etc. After asking for a change in separator to pipeline "|" R could finally read it. Thank you for your time and response. – NuValue Mar 17 '15 at 16:25