0

I have 2 data frames (drug and class) which I need to join by last level of ATC classification code and also add 4 additional columns with corresponding parent levels.

I came up with 2 solutions, but first one is quite verbose and second one is using MS Access (which I want to avoid). Moreover in case I have more levels, the code would by muuuch verbose then this one. Is there any more elegant solution to this problem? How can I perform this kind of self join in R as I did in Access? I am quite begginer in R and SQL so little explanation will be appreciated :)

Sample Data info

  • drug

    • cols: ID, ProductName, level5
    • each row is one product (drug) with unique ID and level5 code of ATC classification (see Wiki-ATC)
  • class

    • cols: classCode, className
    • classCode contains all levels of ATC classification in the same column, level1-level5 of ATC
    • note: this table I can only read.

Short clarification about those classifications and levels

In drug$level5 we have Level5 classCodes:
Level5 - A10BA02 (metformin). It is a member of level 4 - A10BA (biguanides), level 3 - A10B (antidiabetics, ex.insulins), level 2 - A10 (antidiabetics), level 1 - A (Alimentary tract and metabolism)
Each level is strictly defined by its length (L1 = 1char., L2 = 3chars., L3 = 4chars., L4 = 5chars., L5 = 7chars.)

| Level   | Code    | Name                            |
|---------|---------|---------------------------------|
| Level5* | A10BA02 | metformin                       |
| Level4  | A10BA   | biguanides                      |
| Level3  | A10B    | antidiabetics, ex. insulins     |
| level2  | A10     | antidiabetics                   |
| Level1  | A       | alimentary tract and metabolism |

Sample data

drug <- data.frame(ID = 1:5,
                   ProductName = c('ABC', 'CDE', 'FGH', 'IJK', 'LMN'),
                   level5 = c('A10BA02', 'C01BA02', 'C03CA01', 'C03CA03', 'C01BA02'), 
                   stringsAsFactors = F)

class <- data.frame(code = c('A', 'A10', 'A10B', 'A10BA', 'A10BA02', 'C', 'C01', 'C01B', 'C01BA',
                            'C01BA02', 'C03', 'C03C', 'C03CA', 'C03CA01', 'C03CA03', 'C07', 'C07A',
                            'C07AA', 'C07AA03'),
                    className = c('Alimentary tract and metabolism',
                                  'Antidiabetics', 'Antidiabetics, except insulins',
                                  'Biguanides', 'Metformin', 'Cardiovascular system',
                                  'Cardiacs', 'Antiarythmics, grp I and III',
                                  'Antiarythmics, grp IA', 'Procainamide', 'Diuretics',
                                  'Diuretics strong', 'Sulfonamides', 'Furosemide',
                                  'Piretanide', 'Betablockers', 'Betablockers',
                                  'Non-selective betablockers', 'Pindolol'), 
                    stringsAsFactors = F)
# print
drug
head(class, 8)

Goal

I want to left join class on drug data frame with resulting df as follows: Resulting table should have additional columns, each column for each Level from 1 to 5. The goal is to create a filtering hierarchy where you first filter products by Level1, then Level2, and so on…

+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| ID | ProductName | L1                                  | L2                  | L3                                    | L4                            | L5                     |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| 1  | ABC         | A - Alimentary tract and metabolism | A10 - Antidiabetics | A10B - Antidiabetics, except insulins | A10BA - Biguanides            | A10BA02 - Metformin    |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
| 2  | CDE         | C - Cardiovascular system           | C01 - Cardiacs      | C01B - Antiarythmics, grp I and III   | C01BA - Antiarythmics, grp IA | C01BA02 - Procainamide |
+----+-------------+-------------------------------------+---------------------+---------------------------------------+-------------------------------+------------------------+
...

My dirty solution N.1 using R only

I came up with not pretty and quite verbose solution where I mutate drug$level5 with substr() for each level. Then perform left_join() and after unite() columns.

library(tidyr)
library(dplyr)

sol1 <- drug %>%
  mutate(level1 = substr(level5, 1, 1),
         level2 = substr(level5, 1, 3),
         level3 = substr(level5, 1, 4),
         level4 = substr(level5, 1, 5)) %>%
  left_join(class, by = c('level1' = 'code')) %>%
  left_join(class, by = c('level2' = 'code')) %>%
  left_join(class, by = c('level3' = 'code')) %>%
  left_join(class, by = c('level4' = 'code')) %>%
  left_join(class, by = c('level5' = 'code')) %>%
  select(ID:level4, 
         level1name = className.x,
         level2name = className.y,
         level3name = className.x.x,
         level4name = className.y.y,
         level5name = className
         ) %>%
  unite(L1, level1, level1name, sep = ' - ') %>%
  unite(L2, level2, level2name, sep = ' - ') %>%
  unite(L3, level3, level3name, sep = ' - ') %>%
  unite(L4, level4, level4name, sep = ' - ') %>%
  unite(L5, level5, level5name, sep = ' - ') 

My solution N.2 using Access self join

Another solution was to reshape class table in MS Access with self join a create additional columns for each level and then simply left join this table on drug df in R.

--- sqlReshapedTable
SELECT A.code AS L5,
       A.className AS className,

       L1.code + ' ' + L1.Name AS L1,
       L2.code + ' ' + L2.Name AS L2,
       L3.code + ' ' + L3.Name AS L3,
       L4.code + ' ' + L4.Name AS L4
FROM 
(((class AS A
INNER JOIN class AS L1 ON L1.code = LEFT(A.code, 1))
INNER JOIN class AS L2 ON L2.code = LEFT(A.code, 3))
INNER JOIN class AS L3 ON L3.code = LEFT(A.code, 4))
INNER JOIN class AS L4 ON L4.code = LEFT(A.code, 5);
sol2 <- drug %>% 
  left_join(sqlReshapedTable, by = c('level5' = 'Code'))

Thanks a lot for any help !

Jakub Cee
  • 3
  • 3
  • If you're working with data frames and want to use SQL, I recommend the `sqldf` package, which allows you to execute arbitrary SQL statements against data frames using SQLite from R. The Access solution seems fine, and you can port this to sqldf rather easily, you only need to replace `LEFT` with `SUBSTR`, the rest should just work. – Erik A May 24 '19 at 12:49

1 Answers1

0

Maybe not the best possible solution, but seems to work in your case (I call your data frame class by dclass):

library(tidyverse)

drug %>%
  group_by(ID, ProductName) %>%
  summarise(
    code = list(map_chr(c(1, 3:5, 7), ~ gsub(sprintf('(^.{%s}).*', .x), '\\1', level5)))
    ) %>%
  unnest %>%
  left_join(dclass, by = 'code') %>% 
  rename_all(tolower) %>%
  mutate(
    key       = paste('L', str_count(code, '\\D|\\d+'), sep = ''),
    val       = paste(code, classname, sep = ' - '),
    classname = NULL,
    code      = NULL
    ) %>%
  spread(key, val) %>%
  ungroup() %>%
  arrange(L5) %>%
  rename('ID' = id, 'Product Name' = productname) 

Which outputs:

# A tibble: 5 x 7
#     ID `Product Name` L1                   L2           L3                    L4                L5           
#  <int> <chr>          <chr>                <chr>        <chr>                 <chr>             <chr>        
#1     1 ABC            A - Alimentary trac… A10 - Antid… A10B - Antidiabetics… A10BA - Biguanid… A10BA02 - Me…
#2     2 CDE            C - Cardiovascular … C01 - Cardi… C01B - Antiarythmics… C01BA - Antiaryt… C01BA02 - Pr…
#3     5 LMN            C - Cardiovascular … C01 - Cardi… C01B - Antiarythmics… C01BA - Antiaryt… C01BA02 - Pr…
#4     3 FGH            C - Cardiovascular … C03 - Diure… C03C - Diuretics str… C03CA - Sulfonam… C03CA01 - Fu…
#5     4 IJK            C - Cardiovascular … C03 - Diure… C03C - Diuretics str… C03CA - Sulfonam… C03CA03 - Pi…
utubun
  • 4,400
  • 1
  • 14
  • 17