1

I have the following df where two columns ara labeled with the same name:

dput(df_test)
structure(list(X = c("Gen", "ABCB1", "ABCG2", "CES1"), X.1 = c("Prioridad del gen", 
"Candidato", "Candidato", "Candidato"), X.2 = c("Región codificante", 
"2110", "1526", "3533"), X.3 = c("Categoría Reg. Codif.", "intron", 
"intron", "intron"), X.4 = c("Alineamiento múltiple", "No", "No", 
"No"), X.5 = c("Cromosoma", "7", "4", "16"), X.6 = c("Posición inicial", 
"87153584", "89096060", "55855151"), X.7 = c("Posición final", 
"87153585", "89096061", "55855151"), X.8 = c("Tamaño (pb)", "2", 
"2", "1"), X.9 = c("Nº pb cob. ? 15X", "0", "1", "0"), X.10 = c("Nº pb cob. ? 15X", 
"2", "1", "1"), X.11 = c("% pb cob. ? 15X", "0%", "50%", "0%"
), X.12 = c("Cobertura media", "3", "14,50", "0"), X.13 = c("Nº pb sin cubrir", 
"0", "0", "1"), X.14 = c("Nº pb cob. [1-5]", "2", "0", "0"), 
    X.15 = c("Nº pb cob. [6-14]", "0", "1", "0"), X.16 = c("Nº pb cob. [15-29]", 
    "0", "1", "0"), X.17 = c("Nº pb cob. ? 30X", "0", "0", "0"
    )), class = "data.frame", row.names = c(NA, -4L))

Because the first raw is empty in the original file, the real header becomes part of the df instead of being used as the header. Hence, I use row_to_names to move up the raw containing the names:

df1 <- read.delim("file", header = T) %>% row_to_names(row_number = 1)

Now I need to rename the columns "Nº pb cob. ? 15X" as "Nº pb cob. ≥ 15X" and " Nº pb cob. ≤ 15X", respectively. I've tried with:

  • clean_ rename_at clean_names() after row_to_names() and didn't change anything.

  • rename_vars and rename_at didn't work out either.

    df1 <- rename_at(df1, 10, ~"Num pb cob. ≥ 15X") Error in combine_names(): ! Can't rename duplicate variables to {name}. Run rlang::last_error() to see where the error occurred.

Could any one give me some advice¿?

Thanks!!

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 2
    I recommend fixing the initial file loading rather than fiddling with the headers afterwards. Can’t you read the table with e.g. `readr::read_tsv`? – Konrad Rudolph Feb 22 '22 at 13:05
  • Please also add the first line of the csv file to teh question and not just `dput(df_test)` so we can replicate the parsing and modification of column names – danlooo Feb 22 '22 at 13:13
  • Thanks Konrad, but I did not consider this option because I have 50 files with the same format, therefore, it would take me a lot of time. – Mireia Boluda Feb 22 '22 at 13:15
  • 1
    With 50 files to fix, invest a little time in automating the fix. I believe @KonradRudolph's advice is very sound. It shouldn't be too difficult. We really need to see a (few) sample(s) of your raw files rather than the df you have after import. – Limey Feb 22 '22 at 13:43
  • 1
    @MireiaBoluda To clarify, my suggestion wasn’t to fix the files but rather to fix the way you’re reading them. As mentioned by danlooo, it would be helpful to see the first few lines of your files to understand why the reading is going so wrong. It’s entirely possible that my suggestion isn’t practical but we’d need to see the files to judge that. – Konrad Rudolph Feb 22 '22 at 13:47

2 Answers2

1

I did not find a way to place those especial characters in a data frame variable name, so I used a minor variation.

The idea was to create a function that cleans your data, that way you can apply this function to all your files.

library(stringr)
library(purrr)

test <- structure(
  list(
    X = c("Gen", "ABCB1", "ABCG2", "CES1"), 
    X.1 = c("Prioridad del gen","Candidato", "Candidato", "Candidato"),
    X.2 = c("Región codificante","2110", "1526", "3533"), 
    X.3 = c("Categoría Reg. Codif.", "intron", "intron", "intron"),
    X.4 = c("Alineamiento múltiple", "No", "No", "No"),
    X.5 = c("Cromosoma", "7", "4", "16"),
    X.6 = c("Posición inicial", "87153584", "89096060", "55855151"),
    X.7 = c("Posición final", "87153585", "89096061", "55855151"), 
    X.8 = c("Tamaño (pb)", "2", "2", "1"), 
    X.9 = c("Nº pb cob. ? 15X", "0", "1", "0"), 
    X.10 = c("Nº pb cob. ? 15X", "2", "1", "1"), 
    X.11 = c("% pb cob. ? 15X", "0%", "50%", "0%"),
    X.12 = c("Cobertura media", "3", "14,50", "0"), 
    X.13 = c("Nº pb sin cubrir", "0", "0", "1"), 
    X.14 = c("Nº pb cob. [1-5]", "2", "0", "0"), 
    X.15 = c("Nº pb cob. [6-14]", "0", "1", "0"), 
    X.16 = c("Nº pb cob. [15-29]", "0", "1", "0"),
    X.17 = c("Nº pb cob. ? 30X", "0", "0", "0")), 
  class = "data.frame", row.names = c(NA, -4L))

# Function to clean the names as you need
clean_df_names <- function(df) {
  df_names <- df[1, ] %>%
    unlist(use.names = FALSE)
  
  repeated_names <- which(df_names == 'Nº pb cob. ? 15X')
  
  
  #name_symbols <- c('\u2265', '\u2264') # these are the unicode symbols, but can not be used in df names
  name_symbols <- c('>=', '<=')
  
  new_names <- purrr::map2_chr(
    df_names[repeated_names], name_symbols,
    ~stringr::str_replace(.x, '\\?', .y)
  )
  
  df_names[repeated_names] <- new_names
  
  new_df <- df[-1, ]
  
  setNames(new_df, df_names)
}

test <- clean_df_names(test)

str(test)
#> 'data.frame':    3 obs. of  18 variables:
#>  $ Gen                  : chr  "ABCB1" "ABCG2" "CES1"
#>  $ Prioridad del gen    : chr  "Candidato" "Candidato" "Candidato"
#>  $ Región codificante   : chr  "2110" "1526" "3533"
#>  $ Categoría Reg. Codif.: chr  "intron" "intron" "intron"
#>  $ Alineamiento múltiple: chr  "No" "No" "No"
#>  $ Cromosoma            : chr  "7" "4" "16"
#>  $ Posición inicial     : chr  "87153584" "89096060" "55855151"
#>  $ Posición final       : chr  "87153585" "89096061" "55855151"
#>  $ Tamaño (pb)          : chr  "2" "2" "1"
#>  $ Nº pb cob. >= 15X    : chr  "0" "1" "0"
#>  $ Nº pb cob. <= 15X    : chr  "2" "1" "1"
#>  $ % pb cob. ? 15X      : chr  "0%" "50%" "0%"
#>  $ Cobertura media      : chr  "3" "14,50" "0"
#>  $ Nº pb sin cubrir     : chr  "0" "0" "1"
#>  $ Nº pb cob. [1-5]     : chr  "2" "0" "0"
#>  $ Nº pb cob. [6-14]    : chr  "0" "1" "0"
#>  $ Nº pb cob. [15-29]   : chr  "0" "1" "0"
#>  $ Nº pb cob. ? 30X     : chr  "0" "0" "0"

Created on 2022-02-22 by the reprex package (v2.0.1)

Johan Rosa
  • 2,797
  • 10
  • 18
0

You can do it manually using backticks:

library(tidyverse)
df <-tibble(`Nº pb cob. ? 15X` = seq(2))
df
#> # A tibble: 2 x 1
#>   `Nº pb cob. ? 15X`
#>                <int>
#> 1                  1
#> 2                  2
rename(df, `Nº pb cob. ≤ 15X` = `Nº pb cob. ? 15X`)
#> # A tibble: 2 x 1
#>   `Nº pb cob. ≤ 15X`
#>                <int>
#> 1                  1
#> 2                  2

Created on 2022-02-22 by the reprex package (v2.0.0)

danlooo
  • 10,067
  • 2
  • 8
  • 22