I am trying to convert raw data from a text file into a matrix. I've read the data using readLines()
, then separated the data with grepl()
(i.e. male;20;30.5 => "male" "20" "30.5") into a list.
The only thing is that the data is missing some values where the sex, age, or weight was not recorded or commas took the place of decimal points. In these cases the data list contains rows that look something like this:
##"male" "20" "55.3"
##"male" "45"
or
##"" "55" "55"
I want to apply a function to correct these instances by appending an NA
. Then apply that function to lapply(data.dataList, function)
. Functions in R are not my strongest point but here is my first attempt:
# function to correct column order for weight data
f.assignFields <- function(x) {
# create a blank character vector of length 3
out <- character(3)
sex <- grepl("[[:alpha:]]",x)
out[1] <- x[sex]
age.num <- which(as.numeric(x) <0)
out[2] <- ifelse(length(length(age.num) > 0, x[age.num], NA)
weight.num <- which(as.numeric(x) > 0)
out[3] <- ifelse(length(weight.num) > 0, x[weight.num], NA)
out
}
data.standardFields <- lapply(data.dataList, fassignFields)
I know I want to put the string with a letter to the first column, and put the others in the second and fourth. Also should I replace "," with "." weights before or after apply lapply()
? Just a little nudge in the right direction would be greatly be appreciated.
EDIT:
The data is drawn from the text file is very small. Only nine individuals recording their sex, age and weight. The point of the exercise was to handle the raw data by modifying and transforming the data to examine the usefulness of modifying it yourself, rather than using read.table()
.
male;28;81.3
male;45;
female; 17 ;57,2
female;64;62.8
male;16;55.3
male;;50,1
female;20.4;55
female;;
;55;55
Here's what I did:
#read text file
weight.data <- readLines(text.txt)
#removed white spaces
weight.data <- gsub(" ","",weight.data)
weight.data
[1] "male;28;81.3"
[2] "male;45;"
[3] "female;17;57,2"
[4] "female;64;62.8"
[5] "male;16;55.3"
[6] "male;;50,1"
[7] "female;20.4;55"
[8] "female;;"
[9] ";55;55"
#split strings by semicolon
weight.dataList <-strsplit(weight.data, split = ";")
weight.dataList
[[1]]
[1] "male" "28" "81.3"
[[2]]
[1] "male" "45"
[[3]]
[1] "female" "17" "57,2"
[[4]]
[1] "female" "64" "62.8"
[[5]]
[1] "male" "16" "55.3"
[[6]]
[1] "male" "" "50,1"
[[7]]
[1] "female" "20.4" "55"
[[8]]
[1] "female" ""
[[9]]
[1] "" "55" "55"
I want to add NA's to the missing rows. I am trying to create a function that will correct the row dimnensions for the field. For example, the second entry should be have an NA for it's weight.
# function to correct column order and size for weight data
f.assignFields <- function(x) {
# create a blank character vector of length 3
out <- character(3)
sex <- grepl("[[:alpha:]]",x)
# puts sex in first column
out[1] <- x[sex]
# assigns NA if age missing
age.num <- which(as.numeric(x) <0)
out[2] <- ifelse(length(length(age.num) > 0, x[age.num], NA)
# assigns NA if weight missing
weight.num <- which(as.numeric(x) > 0)
out[3] <- ifelse(length(weight.num) > 0, x[weight.num], NA)
out
}
data.standardFields <- lapply(data.dataList, fassignFields)
In the end I will be using unlist()
and matrix()
to transform the data to row-column format. I want to replace the data's missing values with NA, put the data in the following order "Sex, age, weight" and fix the weights so that the 55,1 is shown as 55.1.