0

I have a variable with 75 levels, that I would like to format. However, I find it difficult to do so without formatting a level wrong.

As you know creating a factor with its levels is done like this:

df$f <- factor(df$f, levels=c('a','b','c'),
  labels=c('Treatment A','Treatment B','Treatment C'))

Is this there a way to code this differently so that the label is written next to the level. I'm looking for a code in this structure:

'a' = 'Treatment A'
'b' = 'Treatment B'
'c' = 'Treatment C'

Thanks in forward

ebay
  • 109
  • 1
  • 7
  • 1
    If it's "only" about adding 'Treatment' to your levels (perhaps with some formatting), consider `paste` to save some typing (and potential errors...). – Henrik Jan 18 '22 at 21:35
  • can you say more about where the names and levels are coming from? Are they stored in a file, do they have a regular structure as shown here (e.g. `labels <- paste("Treatment", toupper(levels))`), ... ? – Ben Bolker Jan 18 '22 at 21:42
  • @BenBolker. I have a dataset with codes for the specific location of diagnosed cancers. The data mostly uses ICD-10 codes, but not exclusively. For the study I am working on, there are 75 subsites. In general, I can prepare a file with some regular structure as: 'C00.0 ' = 'Upper lip cancer' 'C00.1' = 'Lower lip cancer' – ebay Jan 18 '22 at 22:20
  • @BenBolker Thanks for the great tip. I actually have several lists of values for variables. I am currently working with a data frame with over 300 variables and cancer site is just only one. There are many variables that I need to format. Previously, I used to work with SAS and I had labels assigned for those variables. I will probably create an excel file with several sheets for all those levels and labels. Do you have any other useful tips considering formatting? – ebay Jan 20 '22 at 19:20
  • @BenBolker For example, is there a way to automate the process, so that R searches in the excel file for a variables name with a suffix "_label" and "_level" and automatically adds those to the corresponding variable? – ebay Jan 20 '22 at 19:42
  • 1
    Yes. Maybe post that as a new question? – Ben Bolker Jan 20 '22 at 19:44
  • @BenBolker I have posted it here. Thanks https://stackoverflow.com/questions/70793773/how-to-automate-adding-factors-to-variables-in-large-data-frame-in-r – ebay Jan 20 '22 at 21:42

2 Answers2

3

You could use a named vector for your level-label-pairs and convert to a factor like so:

foo <- c("a", "c", "b")

rec <- c(
  "a" = "Treatment A",
  "b" = "Treatment B",
  "c" = "Treatment C"
)

factor(foo, levels = names(rec), labels = rec)
#> [1] Treatment A Treatment C Treatment B
#> Levels: Treatment A Treatment B Treatment C
stefan
  • 90,330
  • 6
  • 25
  • 51
3

If you have a long list of equivalences it's generally a good workflow to include it as a separate file, e.g. icdcodes.csv containing

code,descr
C00.0,Upper lip cancer
C00.1,Lower lip cancer
...

Then you could do:

codeinfo <- read.csv("icdcodes.csv")
factor(foo, levels = codeinfo$code, labels = codeinfo$descr

Ideally, you could even get the ICD10 descriptions straight from the CDC (although in practice this probably doesn't work because the descriptions are longer than yours, e.g. C000 is "Malignant neoplasm of external upper lip", not "Upper lip cancer" ...) [Also note that the CDC file doesn't have a dot separator in the codes, C0000 rather than C00.00]

icd_url <- "https://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2022/icd10cm_codes_2022.txt"
codeinfo <- read.fwf(icd_url, widths = c(8,100))
names(codeinfo) <- c("code", "descr")
codeinfo$code <- trimws(codeinfo$code)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453