4

I work with data where most header name that are very long strings. These are cryptic but contain important details that cannot be forgotten. Long column names are difficult to work with for various display reasons as well as programmatic ones. To help with this, I typically retain the original column names as Hmisc labels & rename the columns with uninformative names like V1, V2, V3... etc or with some truncated (but still long & often not unique) version of the long name.

library(Hmisc)
myDF <- read.csv("someFile.csv")
myLabels <- colnames(myDF)
label(myDF, self=FALSE) <- myLabels
colnames(myDF) <- paste0("V", 1:ncol(myDF))

I can now work with the short names V & still look up the labels to get the original names. However, this is still less than satisfactory... myDF is now composed of class "labelled" and contains character vectors although my data is numeric in nature. Converting to numeric or even subsetting myDF will cause the labels to be dropped. Does anyone have some better suggestions? In particular I need to subset data, & I also find indexing by number to be clumsy & error prone.

Due to large data relative to RAM, I cannot keep copies of both numeric & "labelled" data.frames. I have also tried creating hash objects using the hash package:

library(hash)
myHash <- hash(colnames(myDF), label(myDF))

Or via lists:

nameList <- list()
for(name in colnames(myDF)) {
  nameList[[name]] <- label(myDF)[name]
}

But... I also find these unsatisfactory mostly because they can fall out of synch with myDF after various manipulations & they are not accessible from the same object. Perhaps I just need to be more diligent.

Lastly, I thought that perhaps a solution would be a custom class that contains a data.frame & some other data structures to know the very meaningless terse name, the verbose & non-unique nickname, & the true variable name. But this would require overloading all the indexing operators & is likely way over my head skill wise.

So any other purposed solutions? Any help appreciated.

raco
  • 67
  • 5

1 Answers1

1

You could take a relational database kind of approach here. Create a separate data.frame that expresses the associations between the abbreviated and long names.

library(Hmisc)
myDF <- read.csv("someFile.csv")
LongNames <- colnames(myDF)
colnames(myDF) <- paste0("V", 1:ncol(myDF))
ShortNames <- colnames(myDF)
NameTable <- cbind(LongNames, ShortNames)

Even if your data is later manipulated, the associations between short and long names for variables should remain unchanged. Of course, each time you create a new variable that requires a long name, you'll need to add a new row to the NameTable, but you'd need to put that long name somewhere anyway.

To retrieve the long name easily using the short name, you could define a function for that purpose.

L = function(x){NameTable[which(ShortNames == x),1]}
L(V3) #gives long name of V3
Paul de Barros
  • 1,170
  • 8
  • 22
  • Could you elaborate on why this might be more user friendly or useful than the hash? – raco Mar 04 '15 at 21:46
  • I am not familiar with `hash`, but you mentioned that it falls out of sync after various manipulations. So far as I know, no data manipulation should change the _name_ of a variable in a data.frame, so that unchanging element would continue to be associated with the long name stored in `NameTable` in my example. If you dropped or reordered the columns, the each variable name would continue to be associated with the appropriate long name. Or am I not understanding what you mean by "fall out of sync"? – Paul de Barros Mar 04 '15 at 23:45
  • In the answer, I added a function for quickly retrieving the long name. – Paul de Barros Mar 05 '15 at 00:36
  • 1
    @raco I agree this answer sounds just like the hash solution, but I also agree with Paul that it's not at all clear what you mean by "fall out of sync". What are you doing that disrupts a 1-1 column name to label match? An example would be very helpful. – Gregor Thomas Mar 05 '15 at 01:38