2

Suppose a data:

  df1 <- tibble::tribble(~"M1", ~"M2",  ~"Beer, pints", ~"Coffee, oz",  ~"Gasoline, galons",    ~"Milk, galons",    ~"Warehouse, square feet",  ~"Nearest place, miles",    
    "NY",   "22",   "10",   "12",   "15",   "100",  "100",  "20",
    "NY",   "20",   "9",    "10",   "12",   "100",  "100",  "20",
    "NY",   "18",   "8",    "9",    "11",   "100",  "100",  "20",
    "M1",   "M2",   "Beer, liters", "Coffee, cups (120 ml)",    "Gasoline, liters", "Milk, liters", "Warehouse, square meters", "Nearest place, kilometers",
    "PR",   "22",   "7",    "8",    "9",    "70",   "67",   "7",
    "PR",   "20",   "6",    "7",    "8",    "80",   "75",   "7",
    "M1",   "M2",   "Beer, pints",  "Coffee, oz",   "Gasoline, liters", "Milk, liters", "Warehouse, square feet",   "Nearest place, miles",
    "KR",   "22",   "6",    "6",    "7",    "60",   "50",   "9",
    "KR",   "20",   "5",    "6",    "8",    "55",   "65",   "9",
    "KR",   "18",   "5",    "6",    "8",    "50",   "55",   "9")

For visual representation: enter image description here

Is there a nice method to recalculate all columns in the same metrics (like if it is liters, then the entrire column should be liters; if miles (not kilometers), then the entire column to be miles [based on condition in the subheadings inside]? It could be great to think on the nicest methods to solve it.

PS: for information:

1 gallon = 3.78541 liters

1 pint = 0.473176 liters

1 oz = 0.0295735 liters

11 square feet = 1.02193 square meters

1 mile = 1.60934 kilometers

I am just wondering and just started to consider for solution. I am interested to look for possible nice solutions. In addition, it will be interesting for the entire R community to think on the best methods to edit the data by condition.

Sotos
  • 51,121
  • 6
  • 32
  • 66

1 Answers1

1

When the data is sloppy, we must also get our hands dirty.I thought of way, with many steps.

Data

df1 <-
structure(list(m1 = c("M1", "NY", "NY", "NY", "M1", "PR", "PR", 
"M1", "KR", "KR", "KR"), m2 = c("M2", "22", "20", "18", "M2", 
"22", "20", "M2", "22", "20", "18"), beer = c("Beer, pints", 
"10", "9", "8", "Beer, liters", "7", "6", "Beer, pints", "6", 
"5", "5"), coffee = c("Coffee, oz", "12", "10", "9", "Coffee, cups (120 ml)", 
"8", "7", "Coffee, oz", "6", "6", "6"), gasoline = c("Gasoline, galons", 
"15", "12", "11", "Gasoline, liters", "9", "8", "Gasoline, liters", 
"7", "8", "8"), milk = c("Milk, galons", "100", "100", "100", 
"Milk, liters", "70", "80", "Milk, liters", "60", "55", "50"), 
    warehouse = c("Warehouse, square feet", "100", "100", "100", 
    "Warehouse, square meters", "67", "75", "Warehouse, square feet", 
    "50", "65", "55"), nearest_place = c("Nearest_place, miles", 
    "20", "20", "20", "Nearest place, kilometers", "7", "7", 
    "Nearest place, miles", "9", "9", "9")), row.names = c(NA, 
-11L), class = c("tbl_df", "tbl", "data.frame"))

Convert function

convert_unit <- function(value,unit){
  
  m <- 
  case_when(
    unit == "galons" ~ 3.78541,
    unit == "pints" ~ 0.473176,
    unit == "oz" ~ 0.0295735,
    unit == "squarefeet" ~ 1.02193/11,
    unit == "miles" ~ 1.02193/11,
    TRUE ~ 1
    )

  output <- m*as.numeric(value)
  
  return(output)
}

Data preparation

First, I would add the header as the first row and also create better names.

library(dplyr)
library(stringr)
library(tidyr)
#remotes::install_github("vbfelix/relper")
library(relper)


or_names <- names(df1)
new_names <- str_to_lower(str_select(or_names,before = ","))
n_row <- nrow(df1)

df1[2:(n_row+1),] <- df1
df1[1,] <- as.list(or_names)
names(df1) <- new_names

Data manipulation

Then, I would create new columns with the units, and the apply the function to each one.

df1 %>% 
  mutate(
    across(.cols = -c(m1:m2),.fns = ~str_keep(str_select(.,after = ",")),.names = "{.col}_unit"),
    aux = beer_unit == "",
    across(.cols = ends_with("_unit"),~if_else(. == "",NA_character_,.))) %>% 
  fill(ends_with("_unit"),.direction = "down") %>%
  filter(aux) %>% 
  mutate(
    across(
      .cols = beer:nearest_place,
      .fns = ~convert_unit(value = .,unit = get(str_c(cur_column(),"_unit")))
      )
  ) %>% 
  select(-aux,-ends_with("_unit"))

Output

# A tibble: 8 x 8
  m1    m2     beer coffee gasoline  milk warehouse nearest_place
  <chr> <chr> <dbl>  <dbl>    <dbl> <dbl>     <dbl>         <dbl>
1 NY    22     4.73  0.355     56.8  379.      9.29         1.86 
2 NY    20     4.26  0.296     45.4  379.      9.29         1.86 
3 NY    18     3.79  0.266     41.6  379.      9.29         1.86 
4 PR    22     7     8          9     70      67            7    
5 PR    20     6     7          8     80      75            7    
6 KR    22     2.84  0.177      7     60       4.65         0.836
7 KR    20     2.37  0.177      8     55       6.04         0.836
8 KR    18     2.37  0.177      8     50       5.11         0.836
Vinícius Félix
  • 8,448
  • 6
  • 16
  • 32
  • Dear Vinicius! Great idea. I have never heard about the "relper" package (it is outside the standard CRAN). I am not sure if this solution is replicable: [...the previous solution code...]+ select(-aux,-ends_with("_unit")) Error in select(., -aux, -ends_with("_unit")) : unused arguments (-aux, -ends_with("_unit")) – Alexander Shemetev Jan 13 '23 at 15:37
  • relper is my personal package, I put my own functions there to help me with data analysis. I rerun the solution, I updated the answer with the data, because I changed the name of the last column. – Vinícius Félix Jan 13 '23 at 15:52
  • Dear Vinicius! Thanks for the update. Wow, it looks fantastic (I mean the package). Still I have replicability issue: + select(-aux,-ends_with("_unit")) Error in select(., -aux, -ends_with("_unit")) : unused arguments (-aux, -ends_with("_unit")) Maybe you have them in your system (aux, ends_with("_unit") from your analysis, but they are not in the code. Could you, please, share your full solution? I will check your package later, it sounds good :) – Alexander Shemetev Jan 13 '23 at 16:07
  • I run in a clean environment, this variables are create in the first mutate. Are you running both codes? – Vinícius Félix Jan 13 '23 at 16:09
  • Or, maybe, some functions in your package are integrated with other packages and have duplicate-named-functions in some other packages I use in my system (like "select" function can be of dplyr, and can be of three different packages). – Alexander Shemetev Jan 13 '23 at 16:10
  • Yes, I run both codes in a clean environment (with just many libraries run). – Alexander Shemetev Jan 13 '23 at 16:10
  • That is strange, because the only function of my package are str_keep and str_select, let me try again – Vinícius Félix Jan 13 '23 at 16:11
  • Dear Vinicius! I made all the tests, - it is the package integration issue (like what I wrote about the "select" function example (it can be dplyr, and at least three other packages). I switch off all the packages that could cause misinterpreting, and it finally worked :) I hope this info will help you in developing of your own package :) – Alexander Shemetev Jan 13 '23 at 16:14
  • Could you run `?select`. All packages with a select function will be shown – Vinícius Félix Jan 13 '23 at 16:15
  • Packages used:tidyverse GGally plotly skimr car stargazer texreg lmtest sandwich boot lme4 plm margins data.table,tidyverse,magrittr, tidyr, reshape2, expss, dplyr, ggplot2, purrr, GGally, cluster, readxl, writexl, psych, knitr, ExPanDaR, kableExtra, plm, sampleSelection, nnet, ggmap, scales, RPostgreSQL,readr,lubridate,seasonal,stargazer,merTools,RColorBrewer,colorRamps, nycflights13, scales, zoo, stringr, maps, mapdata, gtrendsR, cdlTools, usmap, rnaturalearth, WDI, tigris, ggrepel, rworldmap, gapminder – Alexander Shemetev Jan 13 '23 at 16:20
  • 'select' was found in the following packages (in not the full set of packages I usually use): Geometric subsetting (in package raster in library E:/Program Files/R-4.2.2/library) Ridge Regression (in package MASS in library E:/Program Files/R-4.2.2/library) Subset columns using their names and types (in package dplyr in library E:/Program Files/R-4.2.2/library) – Alexander Shemetev Jan 13 '23 at 16:23
  • Two tips, you can write `dplyr::select` and also there is the package `conflicted` that help in this cases – Vinícius Félix Jan 13 '23 at 16:25
  • Dear Vinicius! Yes, thanks, I know. I use it like this dplyr::select (if it is select from the dplyr package). When you create a package, sometimes package conflict (that you use a supportive function) can make your entire function fail to perform operations (I think it was this case). Loading additional packages makes your solution stop working, which is sad because your solution is good :) I added my packages I use to help you to develop your own package and make it fully robust. – Alexander Shemetev Jan 13 '23 at 16:29