1

Hi I have the below dataframe. Since the column contains NA's the datatype of the column is character. Now, I need to get the column name and index which only contains String value.

In the example below, I want to get the column name and column index of Zo-A and Zo-B:

 ZONE-1        Zo-A         Zone-3        Zo-B
 58            On             75          NA
 60            NA             NA          High
 NA            Off            68          Low
 70            On             NA          NA

So far I tried to first convert all of them to numeric, which created NA's for Zo-A and Zo-B column. And if I use the below code for column index, I'm getting NA's as a result

a <- which(colnames(df)=="Zo-A" )
integer(0)

match_col <- match(c("Zo-A","Zo-B")names(df))
NA NA

I need to perform below operations:

  1. I need to first get the column names which consists of String values
  2. I need the column index for the same
aynber
  • 22,380
  • 8
  • 50
  • 63
Anagha
  • 3,073
  • 8
  • 25
  • 43
  • The dash is not allowed in colnames so it is converted to `.`, at least when I tried it), try: `which(colnames(df)=="Zo.A")` – timfaber Sep 06 '17 at 07:33
  • By default it takes dash. And my actual column name is pretty big, with lots of dashes. And the columns I have in my dataset is around 100. So converting them all to ` .` would be difficult. – Anagha Sep 06 '17 at 07:40
  • No than that's the issue only by reading the data myself, can you share some input (dput)? – timfaber Sep 06 '17 at 07:43
  • `a <- which( colnames(JCI_S2_Reshape)=="MUS2-T-H.ZONE-1-T.MUS2-T-H.ZONE-1-T.Present Value (Trend1)" )` this worked though. But first I need to find out those column names with string value in it. – Anagha Sep 06 '17 at 07:46
  • What do you mean by *string-value*? Do you need something like this `grep('\\D+', colnames(d), value = TRUE)` ? – Jaap Sep 06 '17 at 07:54
  • Works with me, I cannot reproduce your error. The column index returned by `which` is `2`, as expected. – Rui Barradas Sep 06 '17 at 07:54
  • @timfaber when reading the data in with `read.table`, use `check.names = FALSE`, and the column names will have the dashes. – Rui Barradas Sep 06 '17 at 07:54
  • yes, worked for me as well using dots hence I figured it was symbol-related, thanks! – timfaber Sep 06 '17 at 07:56
  • @Jaap : As I have mentioned in my example, "Zo-A" and "Zo-B" columns consists of values like "On", "Off" etc. where as the other 2 columns have numeric value. Hence I need to figure out those columns with "string values" – Anagha Sep 06 '17 at 08:00
  • @RuiBarradas I'm loading the file using "read.csv" `df<- read.csv("Mus.csv",header = T, stringsAsFactors=FALSE)` – Anagha Sep 06 '17 at 08:02
  • When columns have string values they will be automatically forced to *character*-class (when you use `stringsAsFactors=FALSE` in `read.csv`). Hence, `sapply(d, is.character)` will give you a logical vector indicating which columns contain *'string'*-values. To get the name, you can use: `colnames(d)[sapply(d, is.character)]`. – Jaap Sep 06 '17 at 08:26
  • @Jaap thanks, this lists out all the column names, since it contains "NA" which consider that column as character. I need those columns, whose values have "On", "Off" etc. As I have explained above in the example as well. In the above example, i need to get the colnames of Z0-A and Z0-B. In my actual dataset, I have more columns around 100, and need to figure out those colnames similarly – Anagha Sep 06 '17 at 08:41
  • 2
    I wonder whether you tried my code, because it gives me exact the columnnames you need. – Jaap Sep 06 '17 at 08:54
  • Thanks. Again it gives all the character columns. I need only those character columns which has "String value" like on,off, high , low and **not** numerical value like 58,70. I have clearly mentioned above with an example. – Anagha Sep 06 '17 at 09:10
  • Reading a `data.frame` from csv with columns containing `NA`, is not affecting the `class` for me. I get the columns 'Zone-1' and 'Zone-3' as `numeric` only. If your data itself contains `NA` as a string "NA" then you can specify that in the read.csv setting this parameter `na.strings = c("NA")`. – tushaR Sep 06 '17 at 10:00

3 Answers3

4

For what I understand of your question, what you want or need is really, really simple.

First, read the data in.

df <- read.table(text = "
ZONE-1        Zo-A         Zone-3        Zo-B
 58            On             75          NA
 60            NA             NA          High
 NA            Off            68          Low
 70            On             NA          NA
", header = TRUE, check.names = FALSE)

str(df)
'data.frame':   4 obs. of  4 variables:
 $ ZONE-1: int  58 60 NA 70
 $ Zo-A  : Factor w/ 2 levels "Off","On": 2 NA 1 2
 $ Zone-3: int  75 NA 68 NA
 $ Zo-B  : Factor w/ 2 levels "High","Low": NA 1 2 NA

df
  ZONE-1 Zo-A Zone-3 Zo-B
1     58   On     75 <NA>
2     60 <NA>     NA High
3     NA  Off     68  Low
4     70   On     NA <NA>

Now, question (1), "first get the column names which consists of String values". All column names consist of string values so this can be done either with names or with colnames.

names(df)
[1] "ZONE-1" "Zo-A"   "Zone-3" "Zo-B" 

colnames(df)
[1] "ZONE-1" "Zo-A"   "Zone-3" "Zo-B" 

Now question (2), to get the column index of "the same". (I assume it's of column Zo-A you are asking for.)

a <- which(colnames(df) == "Zo-A")
a
[1] 2

a2 <- grep("Zo-A", colnames(df))
a2
[1] 2

Data in dput format.

df <-
structure(list(`ZONE-1` = c(58L, 60L, NA, 70L), `Zo-A` = structure(c(2L, 
NA, 1L, 2L), .Label = c("Off", "On"), class = "factor"), `Zone-3` = c(75L, 
NA, 68L, NA), `Zo-B` = structure(c(NA, 1L, 2L, NA), .Label = c("High", 
"Low"), class = "factor")), .Names = c("ZONE-1", "Zo-A", "Zone-3", 
"Zo-B"), class = "data.frame", row.names = c(NA, -4L))

Edit
If you need to get only the column names composed of alphabetic characters and punctuation marks, you can use the following regular expression.

a3 <- grep("^[[:alpha:]|[:punct:]]*$", colnames(df))
a3
[1] 2 4
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thanks. But, this approach doesn't fit my scenario. **1.** I'm reading the file using read.csv **2.** **I need only those column names, which consists of values like "on", "off", etc. Not the columns which consists of numerical values.** **3.** I have a total of 100 columns, out of which around 50 columns consists of values like "On", "off" etc. Hence would be difficult to follow the above approach to find the column names – Anagha Sep 06 '17 at 08:17
  • @Anagha If you need only the column names, maybe you should try reading them in with `readLines`. Set argument `n = 1` and it will read at most one line, the first where the column headers are supposed to be. If this works, you will get a character vector of those names, no need for extra work. Then use `which` or `grep` to get the indices of the columns you want. – Rui Barradas Sep 06 '17 at 08:28
  • No, readLines is not giving the result i need. And would prefer, read.csv – Anagha Sep 06 '17 at 08:44
  • @Anagha I've just edited my answer, see if the issue can be solved with that final `grep`. – Rui Barradas Sep 06 '17 at 09:09
  • Sorry, it did not. – Anagha Sep 06 '17 at 09:13
  • @Anagha What went wrong? Can you make your question more clear then? – Rui Barradas Sep 06 '17 at 09:28
1

To obtain this we can use the code below:

K=sapply(df,function(x)any(grepl("\\D+",x)))
 names (df)[K]
    Zo.A Zo.B 

 Which (k)
   Zo.A Zo.B 
     2    4 
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • I need to first figure out the column names, then the index. As I have mentioned earlier, my actual dataset consists around 100 columns, out of which few columns have the alpha values. Hence I need to get that first – Anagha Sep 06 '17 at 12:19
1

While reading the data.frame you can specify 'stringsAsFactors=FALSE' and if your data itself contains NA as a string "NA" then you can specify that in the read.csv setting this parameter na.strings = c("NA")

df = read.csv('file.csv',header=T,stringsAsFactors=FALSE,na.strings=c("NA"))

Then try:

type = sapply(df,class) 
indexes = which(type=='character')
nameofindexes = names(indexes)
tushaR
  • 3,083
  • 1
  • 20
  • 33