-2

I am trying to convert raw data from a text file into a matrix. I've read the data using readLines(), then separated the data with grepl() (i.e. male;20;30.5 => "male" "20" "30.5") into a list.

The only thing is that the data is missing some values where the sex, age, or weight was not recorded or commas took the place of decimal points. In these cases the data list contains rows that look something like this:

##"male"    "20"   "55.3"
##"male" "45" 

or

##"" "55" "55"

I want to apply a function to correct these instances by appending an NA. Then apply that function to lapply(data.dataList, function). Functions in R are not my strongest point but here is my first attempt:

# function to correct column order for weight data
f.assignFields <- function(x) {
# create a blank character vector of length 3
out <- character(3)
sex <- grepl("[[:alpha:]]",x)
out[1] <- x[sex]
age.num <- which(as.numeric(x) <0)
out[2] <- ifelse(length(length(age.num) > 0, x[age.num], NA)
weight.num <- which(as.numeric(x) > 0)
out[3] <- ifelse(length(weight.num) > 0, x[weight.num], NA)
out
}

data.standardFields <- lapply(data.dataList, fassignFields)

I know I want to put the string with a letter to the first column, and put the others in the second and fourth. Also should I replace "," with "." weights before or after apply lapply()? Just a little nudge in the right direction would be greatly be appreciated.

EDIT: The data is drawn from the text file is very small. Only nine individuals recording their sex, age and weight. The point of the exercise was to handle the raw data by modifying and transforming the data to examine the usefulness of modifying it yourself, rather than using read.table().

male;28;81.3
male;45;
female; 17 ;57,2
female;64;62.8
male;16;55.3
male;;50,1
female;20.4;55
female;;
;55;55

Here's what I did:

#read text file
weight.data <- readLines(text.txt)         

#removed white spaces
weight.data <- gsub(" ","",weight.data)
weight.data

[1] "male;28;81.3"     
[2] "male;45;"      
[3] "female;17;57,2"
[4] "female;64;62.8"  
[5] "male;16;55.3"   
[6] "male;;50,1"       
[7] "female;20.4;55"     
[8] "female;;"          
[9] ";55;55" 

#split strings by semicolon
weight.dataList <-strsplit(weight.data, split = ";")
weight.dataList

[[1]]
[1] "male"    "28"   "81.3"

[[2]]
[1] "male" "45"  

[[3]]
[1] "female" "17"     "57,2"  

[[4]]
[1] "female" "64"   "62.8"

[[5]]
[1] "male"  "16"   "55.3"

[[6]]
[1] "male"    ""     "50,1"

[[7]]
[1] "female"    "20.4" "55"  

[[8]]
[1] "female" ""  

[[9]]
[1] ""   "55" "55"

I want to add NA's to the missing rows. I am trying to create a function that will correct the row dimnensions for the field. For example, the second entry should be have an NA for it's weight.

# function to correct column order and size for weight data
f.assignFields <- function(x) {
# create a blank character vector of length 3
out <- character(3)
sex <- grepl("[[:alpha:]]",x)
# puts sex in first column
out[1] <- x[sex]
# assigns NA if age missing
age.num <- which(as.numeric(x) <0)
out[2] <- ifelse(length(length(age.num) > 0, x[age.num], NA)
# assigns NA if weight missing
weight.num <- which(as.numeric(x) > 0)
out[3] <- ifelse(length(weight.num) > 0, x[weight.num], NA)
out
}

data.standardFields <- lapply(data.dataList, fassignFields)

In the end I will be using unlist() and matrix() to transform the data to row-column format. I want to replace the data's missing values with NA, put the data in the following order "Sex, age, weight" and fix the weights so that the 55,1 is shown as 55.1.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
  • A reproducible example and a desired output could come in handy – David Arenburg May 03 '14 at 21:04
  • 1
    It seems like you could just use read.table(..., sep=";", fill=T) or something. Can you give a larger sample of raw data. Also, where does the data.dataList function come from? What input does it expect? – MrFlick May 03 '14 at 21:05
  • TL;DR - make it more consise and more concrete at the same time. – Honza Zidek May 03 '14 at 21:27

1 Answers1

0

The easiest way is to use read.table, but it seems your professor is trying to torture you. No data set anywhere, ever, would have 20.4 listed as a person's age.

> ## txt <- "male;28;81.3
  ## male;45;
  ## female; 17 ;57,2
  ## female;64;62.8
  ## male;16;55.3
  ## male;;50,1
  ## female;20.4;55
  ## female;;
  ## ;55;55"
> x <- gsub("\\s+", "", readLines(textConnection(txt))) 
> rpl.comma <- gsub(",", ".", x)
> spl <- strsplit(rpl.comma, ";")
> M <- matrix(0, nrow = length(x), ncol = 3)
> for(j in 1:3){
    M[,j] <- sapply(seq(spl), function(i){
      ifelse(spl[[i]][j] == "", "NA", spl[[i]][j])
    })
  }
> DF <- data.frame(M)
> names(DF) <- c("sex", "age", "weight")
> DF
##      sex  age weight
## 1   male   28   81.3
## 2   male   45   <NA>
## 3 female   17   57.2
## 4 female   64   62.8
## 5   male   16   55.3
## 6   male   NA   50.1
## 7 female 20.4     55
## 8 female   NA   <NA>
## 9     NA   55     55
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245