0

I have a file that is in txt format and delimited by tabs, here is an extract:

id      1     2     4     15      18       20
1_at   100   200   89    189     299      788
2_at     8    78   33     89      90       99
3_xt   300    45   53    234      89       34
4_dx    49    34   88      8       9       15

Now I have a file, also in txt format, and separated with commas with the following data:

18,1,4,20

So based on that file, I would like to read it and extract only the columns from the first tabulated data so I can store in another file like this:

(important: I need to keep the order of the data to be stored according to the csv file)

id      18       1     4      20
1_at   299     100    89     788
2_at    90       8    33      99
3_xt    89     300    53      34
4_dx     9      49    88      15

If the data that I want to extract would be the rows it would be easier because I could read it row by row and comparing to my txt file (I have already done that), but I am stuck with this column stuff.

I wonder if there is any way to extract directly the columns using some subindex function?

Any help will be greatly appreciated.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Layla
  • 5,234
  • 15
  • 51
  • 66
  • [What have you tried?](http://whathaveyoutried.com) –  Feb 08 '13 at 15:50
  • 1
    Column names cannot start with digits. – CHP Feb 08 '13 at 15:53
  • @Jack Maney I have got a similar issue what the thing that I wanted to extract was from the rows, so I used two loops one that reads the txt file and the other that reads the tabulated format. The comparison is made sequentially and when the data matches I stored in another file. With this thing of the columns I dont know how I can do that. Only maybe to transpose the data first and treat it like if its in rows and then transpose that again, but the file is huge so I dont consider that a good solution – Layla Feb 08 '13 at 15:55
  • @geektrader that is the format of the file that I need to extract – Layla Feb 08 '13 at 15:56
  • 1
    @Manuel - The file format is irrelevant. R will not abide by column names that start with digits. –  Feb 08 '13 at 15:58
  • @JackManey not always true and not entirely false either. http://stackoverflow.com/questions/3423085/r-numbers-as-column-names-of-data-frames see reply by Marek. A bit hackish though. – RJ- Feb 08 '13 at 16:29
  • @JackManey That's not true. It is recommended to set `check.names = TRUE` in `read.table` or `data.frame` but not compulsory. – Roland Feb 08 '13 at 16:30

1 Answers1

2

This one way of doing what you want to do if you don't want to read whole file and then filter. This way will also work if you don't know classes of each column before hand. This example assumes that input file has column names not starting with digits as it is not allowed in R. It will work with digits as column name as well, but be cautious as it may fail in other operations.

> txt = 'id      1     2     4     15      18       20
+ 1_at   100   200   89    189     299      788
+ 2_at     8    78   33     89      90       99
+ 3_xt   300    45   53    234      89       34
+ 4_dx    49    34   88      8       9       15'
> 
> df <- read.table(textConnection(txt), header=T, nrows=1, check.names=F)
> 
> df
    id   1   2  4  15  18  20
1 1_at 100 200 89 189 299 788
> 
> #Lets say f is the column filter you read from other file
> f <- c("id", "18", "1", "4", "20")
> 
> f
[1] "id" "18" "1"  "4"  "20"
> 
> #Get Column Classes
> CC <- sapply(df, class)
> CC
       id         1         2         4        15        18        20 
 "factor" "integer" "integer" "integer" "integer" "integer" "integer" 
> #Specify columns that you don't want to read as "NULL" 
> CC[!names(CC) %in% f] <- "NULL"
> CC
       id         1         2         4        15        18        20 
 "factor" "integer"    "NULL" "integer"    "NULL" "integer" "integer" 
> 
> #Read whole data frame again
> df <- read.table(textConnection(txt), header=T, colClasses=CC, check.names=F)
> 
> #get columns in desired order
> df[,f]
    id  18   1  4  20
1 1_at 299 100 89 788
2 2_at  90   8 33  99
3 3_xt  89 300 53  34
4 4_dx   9  49 88  15
CHP
  • 16,981
  • 4
  • 38
  • 57
  • thanks, but if the data is inside a file I cannot use read.table, how I can handle that? – Layla Feb 08 '13 at 16:59
  • yes @geektrader, I have tried the solution, but I need to keep the order of the data. If you see in my example I need that the data would be stored according to the labels like id 18 1 4 20, and not in an ordered fashion – Layla Feb 08 '13 at 17:33
  • @Manuel you can use `df[,f]` – CHP Feb 08 '13 at 17:44