0

I'm working with a data frame (I'll call it 'letters') in R where there are 15 rows by 2 columns. Each column 2 contains a character string like "A|B|C|D|E". I want to split the string at each place a | appears to get the vector c("A", "B", "C", "D", "E"). Here's my best idea of how to do this:

for(i in 1:nrow(letters)){
  letters[i,2] <- strsplit(letters[i,2], split = "[|]")
}

I get a similar error as discussed here ("replacement has [x] rows, data has [y]"), and it seems to be trying to make a separate column for each index of the output vector. I'm sure this is a simple question, but I am new to R and stuck.

2 Answers2

0

Is strsplit(letters[i,2], split = "[|]")[[1]] what you're looking for? You won't be able to put that vector back in letters[i,2], though as it has length 5 (instead of 1).

KaB
  • 65
  • 5
  • Yes, that's what I'm looking for. I didn't realize that you can't store a vector in the cell of a data frame. – gooseshoes May 30 '18 at 19:28
  • What would you suggest doing if what I'm trying to do in the end is ask: which rows contain "A"? Which contain "D"? and so I need a list associated with each row name? – gooseshoes May 30 '18 at 20:01
  • Use `grepl("A", letters[[2]])` to determine if each value contains `"A"`. – Nathan Werth May 30 '18 at 20:51
0

Your second column is (I think) a character vector. strsplit, as it mentions in the documentation (?strsplit) returns a list. Before we get into why your specific situation happened, some general advice:

  1. Make a new column instead of replacing an existing one. This has the added benefit of not losing the original values.
  2. Only replace values in a column with new values of the same class (e.g., character for character, integer for integer).

So I suggest adding a new column of split values:

letters[["splits"]] <- strsplit(letters[[2]], split = "|", fixed = TRUE)

You now have a list column, and each row of this column has a vector of the split letters from the original values.

Why your problem happened

Let's dissect the assignment statement:

letters[i,2] <- strsplit(letters[i,2], split = "[|]")

On the left side of <- is letters[i, 2], which is a data.frame. A data.frame stores all of its data in a list. R allows us to use this fact, especially in assignment. We can add or replace columns just like adding or replacing items in a list.

# This...
letters[, "one"] <- 1
letters[, "two"] <- 2
# is effectively the same as this
letters[, c("one", "two")] <- list(1, 2)

To the right of ->, we have a call to strsplit(), which returns a list. As in the example just above, if you assign a list to a subset of a data.frame, it will be coerced into a data.frame itself. Each element of the list will be considered a column. So, the assignment plays out like this:

  1. If letters[i,2] is "A|B|C|D|E", then strsplit(letters[i,2], split = "[|]") is list(c("A", "B", "C", "D", "E")).
  2. The assignment checks both sides, and sees the data.frame as a the "higher" type, so it coerces the list to a data.frame. The right side is now effectively data.frame(c("A", "B", "C", "D", "E")).
  3. Now it tries to assign a data.frame with 1 column and 5 rows to a subset with 1 column and 1 row. Those dimensions don't match, so it takes what it can from the right side (just the first row) and warns you about what happened.

Why the suggested assignment works

So why isn't there any coercion in this?

letters[["splits"]] <- strsplit(letters[[2]], split = "|", fixed = TRUE)

The left side uses [[ subsetting (treating the data.frame like a list) to add or replace the "splits" column. So no coercion is ever done.

Also, a data.frame can have a list as a column, just like a list can have a list as an element. A data.frame column just has to satisfy two things:

  1. It has to be a vector.
  2. Its length must be equal to the number of rows in the data.frame (recycling's attempted if necessary).

A list is a type of vector. And strsplit() returns a list the same length as its input, so both criteria are met.

Nathan Werth
  • 5,093
  • 18
  • 25