0

I have a csv with 2 columns but should be 7. The first column is a numerical ID. The second column has the other six numerical values. However, there are several different delimiters between them. They all follow the same pattern: numerical value, a dash ("-) OR a colon (":"), eight spaces, and then the next numerical value, until the final numerical value, with nothing after it. It starts with a dash and alternates with a colon. For example:

28.3-        7.1:        62.3-        1.8:        0.5-        196

Some of these cells have missing values denoted by a single period ("."). Example:

24-        .:        58.2-        .:        .-        174

I'm using R but I can't figure out how to accomplish this. I know it probably requires dplyr or tidyverse but I can't find what to do where there are different delimiters and spaces.

So far, I've only successfully loaded the csv and used "str()" to determine that the column with these six values is a factor.

Here is how the data look in the .csv:

enter image description here

Here is how it looks in RStudio after I read it in using read.csv

enter image description here

Here is how it looks in RStudio if I use tab as the delimiter when using read.csv, as suggested in the comments

enter image description here

  • Is this the text as it appears in the file? No headers? Or have you read it in already and now want to split? – camille Feb 12 '20 at 17:22
  • This is how the text appears in the file. The file has headers for the two columns that were included when I read the data in. This column's name is 'panel'. Also, thanks for editing my question so the sample data look correct. – immaprogrammingnoob Feb 12 '20 at 17:30
  • From you example can you not use a tab as a delimiters `read.table(filename, sep="\t", header=TRUE)` – Jim Feb 12 '20 at 17:32
  • I can't tell what the two columns are you're talking about. Can you add a snippet of the file, including the headers? – camille Feb 12 '20 at 17:56
  • I just added a couple of pics to show what's going on. Does that help? I know I can't attach the .csv to the question but perhaps there's something else I can do. – immaprogrammingnoob Feb 12 '20 at 18:09

1 Answers1

0

I would try just to sort out that first column if it is the only one doing the following:

CDC_delim <- read.table('CBC.csv', sep="\t", header=F)

head(CBC_delim)

then to split that first column into two but keep both elements:

CBC_delim <- CBC_delim %>% 
  #
  mutate(column1 = as.character(column1)) %>% # your column names will be different, maybe just V1,
  #
  mutate(col2 = sapply(strsplit(column1,","), `[`, 1),
         col3 = sapply(strsplit(column1,","), `[`, 2))

Should leave you with some basic tidy up such as deleteing the original column1, you can check you column names using colnames(CBC_delim)

But also see:

how-to-read-data-with-different-separators

Jim
  • 558
  • 4
  • 13
  • Thanks for this code. Unfortunately, executing this ends up combining the first column of the original csv file with the first numerical value in the second column. This makes sense to me as my understanding of a .csv file is that the columns are separated by commas but we're telling R to ignore commas and focus on tabs. Is there a way to read in just one column at a time? For example, I read in just the first column of the .csv. Then I read in the second column in another step and that's when I do the tab delimiter. (Hope this makes sense conceptually) – immaprogrammingnoob Feb 12 '20 at 17:49
  • Yep makes sense, sorry that couldn't help! I am a bit of a R novice myself but would suggest there is an answer lurking within `str_split` within the `stringr` package. If the delimiters are consistent you can bring it is as a string then slowly split it into columns. I will have a bash now – Jim Feb 12 '20 at 17:54
  • Can you post it as it appears in the text file at the end of your question? – Jim Feb 12 '20 at 18:00
  • I added some pictures that should help – immaprogrammingnoob Feb 12 '20 at 20:55