1

Whenever I use read.csv.sql I cannot select from the first column with and any output from the code places an unusual character (A(tilde)-..) at the begging of the first column's name.

So suppose I create a df.csv file in in Excel that looks something like this

df = data.frame(
  a = 1, 
  b = 2, 
  c = 3, 
  d = 4)

Then if I use sqldf to query the csv which is in my working directory I get the following error:

> read.csv.sql("df.csv", sql = "select * from file where a == 1")
Error in result_create(conn@ptr, statement) : no such column: a

If I query a different column than the first, I get a result but with the output of the unusual characters as seen below

df <- read.csv.sql("df.csv", sql = "select * from file where b == 2")
View(df)

I get the this table

Any idea how to prevent these characters from being added to the first column name?

Mark Z
  • 63
  • 6
  • 2
    https://stackoverflow.com/questions/22974765/weird-characters-added-to-first-column-name-after-reading-a-toad-exported-csv-fi Check the encoding – Chabo Mar 12 '19 at 21:25
  • If that's the only problem (and you are unable or unwilling to fix the encoding problem, if that is it), then you can always just overwrite the column names with `colnames(x) <- letters[seq_len(ncol(x))]` to get you lower-case letters for all columns. If you have more than 26, you'll need to accommodate ... – r2evans Mar 12 '19 at 21:50
  • 1
    Thanks for the replies. Reading through the link I realized that it was how I was saving the file in excel and by changing from a CSV UTF-8 to just CSV it removed the characters and I could query the first column as usual. – Mark Z Mar 13 '19 at 14:21

2 Answers2

2

The problem is presumably that you have a file that is larger than R can handle and so only want to read a subset of rows into R and specifying the condition to filter it by involves referring to the first column whose name is messed up so you can't use it.

Here are two alternative approaches. The first one involves a bit more code but has the advantage that it is 100% R. The second one is only one statement and also uses R but additionally makes use an of an external utility.

1) skip header Read the file in skipping over the header. That will cause the columns to be labelled V1, V2, etc. and use V1 in the condition.

# write out a test file - BOD is a data frame that comes with R
write.csv(BOD, "BOD.csv", row.names = FALSE, quote = FALSE)

# read file skipping over header
DF <- read.csv.sql("BOD.csv", "select * from file where V1 < 3", 
  skip = 1, header = FALSE)

# read in header, assign it to DF and fix first column
hdr <- read.csv.sql("BOD.csv", "select * from file limit 0")
names(DF) <- names(hdr)
names(DF)[1] <- "TIME"  # suppose we want TIME instead of Time

DF
##   TIME demand
## 1    1    8.3
## 2    2   10.3

2) filter Another way to proceed is to use the filter= argument. Here we assume we know that the end of the column name is ime but there are other characters prior to that that we don't know. This assumes that sed is available and on your path. If you are on Windows install Rtools to get sed. The quoting might need to be changed depending on your shell.

When trying this on Windows I noticed that sed from Rtools changed the line endings so below we specified eol= to ensure correct processing. You may not need that.

DF <- read.csv.sql("BOD.csv", "select * from file where TIME < 3", 
  filter = 'sed -e "1s/.*ime,/TIME,/"' , eol = "\n")
DF
##   TIME demand
## 1    1    8.3
## 2    2   10.3
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks for the detailed response. I actually figured out how to take the characters away completely, but I was going to use your methods if I was unable to do that. – Mark Z Mar 13 '19 at 14:25
1

So I figured it out by reading through the above comments.

I'm on a Windows 10 machine using Excel for Office 365. The special characters will go away by changing how I saved the file from a "CSV UTF-8 (Comma Delimited)" to just "CSV (Comma delimited)".

Mark Z
  • 63
  • 6