0

I have data I need to clean for each row in a file, and I want to insert the cleaned data in a SQLite3 database. I'm using the RSQLite library which requires a dataframe. Here's the code I trying to get working:

# Select feature names for use as column names in X train/test loading
feature_names <- unlist(dbGetQuery(con, "select feature_name from features order by feature_id"), use.names = FALSE);

# Load X training data
X_train_lines <- readLines("data/train/X_train.txt"); # Space delimited with leading and trailing spaces
X_train_values <- vector("list", length(X_train_lines));
names(X_train_values) <- feature_names; # colnames or names?
for (index in 1:length(X_train_lines)) {
  cleaned_line <- gsub("^ *|(?<= ) | *$", "", X_train_lines[index], perl=TRUE); # remove extraneous whitespaces
  X_train_values[index] <- strsplit(cleaned_line, " "); # Wondering if X_train_values[index] is correct? 
}
# Write features data to features table
dbWriteTable(con, "X_train", as.data.frame(X_train_values), row.names = FALSE);

While the code executes without incident, when I attempt to look at the database using DbVisualizer I get an error:

An error occurred while performing the operation:
malformed database schema (X_train) - too many columns on X_train

My only guess is that rows and columns are somehow transposed. My column names should be the value of the feature_names vector.

Also, if anyone has any suggestions for a better approach...

UPDATE

I tried doing a dput, though I had no idea what I was looking at. Here's the top of the summary:

head(summary(X_train_values))

                   Length Class    Mode       
tBodyAcc-mean()-X "561"  "-none-" "character"
tBodyAcc-mean()-Y "561"  "-none-" "character"
tBodyAcc-mean()-Z "561"  "-none-" "character"
tBodyAcc-std()-X  "561"  "-none-" "character"
tBodyAcc-std()-Y  "561"  "-none-" "character"
tBodyAcc-std()-Z  "561"  "-none-" "character"

Again, this leads me to believe the data is all mixed up. It should have 561 columns, some of which are represented above as tBodyAcc-mean()-X etc. The column values should be floating-point numbers which I don't see above.

The table command did not work:

table(X_train_values)
Error in table(X_train_values) : 
  attempt to make a table with >= 2^31 elements

I should have 7,352 rows with 561 columns.

UPDATE 2

I believe my problem is that I'm trying to use a list like an array or arrays. In Ruby, for example, I could do something like this:

x_train_values = []
x_train_lines.each { |line| x_train_values << line.split(' ') }
Jaap
  • 81,064
  • 34
  • 182
  • 193
Edward J. Stembler
  • 1,932
  • 4
  • 30
  • 53
  • even a partial `dput(X_train_Values)` would be rly handy – hrbrmstr Jul 03 '14 at 00:29
  • You don't need a for loop to do a gsub on a vector, simply `gsub("old","new",vector)` will work – JeremyS Jul 03 '14 at 01:04
  • If you make a list and then call a function that coerces to a dataframe, then each list element becomes a column, so it just like the 'transposition' that happens with `apply(dat, 1, ...)`. If you constructed an example using one of the built-in databases in the packages you are using then people might be able to help more. – IRTFM Jul 03 '14 at 04:44

1 Answers1

0

In the following lines

for (index in 1:length(X_train_lines)) {
    cleaned_line <- gsub("^ *|(?<= ) | *$", "", X_train_lines[index], perl=TRUE);
    X_train_values[index] <- strsplit(cleaned_line, " ");
}

you are using single square brackets ([) to access the columns of a data frame, when you should be using double ones ([[). When you use X_train_lines[index], a data frame is returned, which has one column, equal to X_train_lines[index]. When you use X_train_lines[[index]], however, the actual contents of that column are returned (see http://adv-r.had.co.nz/Subsetting.html for more details).

Now, the way gsub works, is that it first converts its argument to a character using as.character and then processes it. In your case, X_train_lines[index] returns a data.frame whose single column is a factor (I guess) and therefore when coerced to a character you will get the factor levels, not the actual contents! So you are actually calling gsub on a string that looks like "1:2:3:...". If you use double brackets instead, then gsub will coerce a factor (instead of a data frame) to a character, which will work as desired.

As an aside, in R you do not need to end your lines with ;. This is needed only to separate multiple statements on the same line.

Finally, it is best to try and avoid for loops because they can be slow and because there are more efficient functions with simpler syntax that can usually accomplish what you need (like lapply, apply, sweep, etc). For column/row/element-wise operations on a data frame/matrix/etc you can use apply, in which case your code will look like

apply(X_train_values, 2, gsub, pattern = "^ *|(?<= ) | *$",
    replacement = "", perl = T)
konvas
  • 14,126
  • 2
  • 40
  • 46
  • Thanks for the clarifications/tips! The reason I'm using a loop is because my X_train data file is space delimited, but also could have extra leading spaces for each column to accommodate a minus sign. Once I've stripped the extra spaces I split the row by spaces to get the correct column values. Then I'm trying to force it to a data.frame so I can save it in the database. Not ideal, but I couldn't think of another way since I'm an R novice. – Edward J. Stembler Jul 04 '14 at 15:19
  • OK I see... I think your `for` loop should work anyway, just replace `[` with `[[` in `X_train_lines[index]`. I was just suggesting `apply` as a more "R-friendly" way of doing it. But you can keep the `for` loop if it works for you. – konvas Jul 04 '14 at 15:22