I work with data where most header name that are very long strings. These are cryptic but contain important details that cannot be forgotten. Long column names are difficult to work with for various display reasons as well as programmatic ones. To help with this, I typically retain the original column names as Hmisc labels & rename the columns with uninformative names like V1, V2, V3... etc or with some truncated (but still long & often not unique) version of the long name.
library(Hmisc)
myDF <- read.csv("someFile.csv")
myLabels <- colnames(myDF)
label(myDF, self=FALSE) <- myLabels
colnames(myDF) <- paste0("V", 1:ncol(myDF))
I can now work with the short names V & still look up the labels to get the original names. However, this is still less than satisfactory... myDF is now composed of class "labelled" and contains character vectors although my data is numeric in nature. Converting to numeric or even subsetting myDF will cause the labels to be dropped. Does anyone have some better suggestions? In particular I need to subset data, & I also find indexing by number to be clumsy & error prone.
Due to large data relative to RAM, I cannot keep copies of both numeric & "labelled" data.frames. I have also tried creating hash objects using the hash package:
library(hash)
myHash <- hash(colnames(myDF), label(myDF))
Or via lists:
nameList <- list()
for(name in colnames(myDF)) {
nameList[[name]] <- label(myDF)[name]
}
But... I also find these unsatisfactory mostly because they can fall out of synch with myDF after various manipulations & they are not accessible from the same object. Perhaps I just need to be more diligent.
Lastly, I thought that perhaps a solution would be a custom class that contains a data.frame & some other data structures to know the very meaningless terse name, the verbose & non-unique nickname, & the true variable name. But this would require overloading all the indexing operators & is likely way over my head skill wise.
So any other purposed solutions? Any help appreciated.