-1

I have a large data set of (~20000x1). Not all the fields are filled, in other words the data does have missing values. Each feature is a string.

I have done the following code runs:

Input:

data <- read.csv("data.csv", header=TRUE, quote = "")
datan <- read.table("data.csv", header = TRUE, fill = TRUE)

Output for the second code:

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 80 elements

Input:

datar <- read.csv("data.csv", header = TRUE, na.strings = NA)

Output:

Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string

I run into essentially 4 problems, that I see. Two of the problems are the error message stated above. The third one is if it doesn't spit out an error message, when I look at the global environment window, I see not all my rows are accounted for, like ~14000 samples are missing but the feature number is right. The other problem I see is, again, not all the samples are counted for and the feature number is not correct.

How can I solve this??

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Are there commas in your data? – Ryan Morton Feb 21 '18 at 22:09
  • 1
    Generally this means that you don't fully understand the format that your file is in. Somewhere there's a unusual character, an unmatched quote, a field that contains a comma, etc. But there's no way for _us_ to figure that out, because we don't have your file. – joran Feb 21 '18 at 22:11
  • 1
    No, but does it matter if the inputs have like periods at the ends? An example of one would be "#DogRules!!! I am feeling happy to see dogs." – Jayganesh Kalla Feb 21 '18 at 22:12
  • Its unstructured data – Jayganesh Kalla Feb 21 '18 at 22:12
  • You could also try using `comment.char = ""`. That should help when you have a pound sign. – desc Feb 21 '18 at 22:26
  • 1
    *"Its unstructured data*" ... doesn't that mean not CSV? – r2evans Feb 21 '18 at 22:38
  • Try to disable quoting like `datar <- read.csv("data.csv", quote = "", row.names = NULL, stringsAsFactors = FALSE)` – Aleh Feb 21 '18 at 22:42
  • 1
    Best go to the `bash` or `dos` command line for a moment depending on your OS. Type `head -3 data.csv` and have look at it. If you are still unsure then post this example to your question. Otherwise this is a how long is my piece of string question. – Stephen Henderson Feb 21 '18 at 23:07

2 Answers2

0

Try the argument comment.char = "" as well as quote. The hash (#) is being read by R as a comment and will cut the line short.

David Foster
  • 447
  • 4
  • 16
0

Can you open the CSV using Notepad++? This will allow you to see 'invisible' characters and any other non-printable characters. That file may not contain what you think it contains! When you get the sourcing issue resolved, you can choose the CSV file with a selector tool.

filename <- file.choose()
data <- read.csv(filename, skip=1)
name <- basename(filename)

Or, hard-code the path, and read the data into R.

# Read CSV into R
MyData <- read.csv(file="c:/your_path_here/Data.csv", header=TRUE, sep=",")
ASH
  • 20,759
  • 19
  • 87
  • 200