0

Below is the first dataframe where I want to remove the first 3 rows:

book1 <- structure(list(Instructions..xyz = c("Note: abc", "", "Set1", 
                                              "id", "632592651", "633322173", "634703802", "634927873", "635812953", 
                                              "636004739", "636101211", "636157799", "636263106", "636752420"
), X = c("", "", "", "title", "asdf", "cat", "dog", "mouse", 
         "elephant", "goose", "rat", "mice", "kitty", "kitten"), X.1 = c("", 
                                                                         "", "", "hazard", "y", "y", "y", "n", "n", "y", "y", "n", "n", 
                                                                         "y"), X.2 = c("", "", "Set2", "id", "632592651", "633322173", 
                                                                                       "634703802", "634927873", "635812953", "636004739", "636101211", 
                                                                                       "636157799", "636263106", "636752420"), X.3 = c("", "", "", "title", 
                                                                                                                                       "asdf2", "cat2", "dog2", "mouse2", "elephant2", "goose2", "rat2", 
                                                                                                                                       "mice2", "kitty2", "kitten2"), X.4 = c("", "", "", "index", "0.664883807", 
                                                                                                                                                                              "0.20089779", "0.752228086", "0.124729276", "0.626285086", "0.134537909", 
                                                                                                                                                                              "0.612526768", "0.769622463", "0.682532524", "0.819015658")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                                                -14L))

I did book1 <- book1[-c(1:3),] but I'm not sure how to make id, title, hazard, id, title, index as the column name instead of Instructions..xyz, etc. See image below for desired output

enter image description here

Then for the second dataframe,

book2 <- structure(list(identity = c(632592651L, 633322173L, 634703802L, 
                                     634927873L, 635812953L, 636004739L, 636101211L, 636157799L, 636263106L, 
                                     636752420L, 636809222L, 2004722036L, 2004894388L, 2005045755L, 
                                     2005535472L, 2005630542L, 2005788781L, 2005809679L, 2005838317L, 
                                     2005866692L), text = c("asdf_xyz", "cat", "dog", "mouse", "elephant", 
                                                            "goose", "rat", "mice", "kitty", "kitten", "tiger_xyz", "lion", 
                                                            "leopard", "ostrich", "kangaroo", "platypus", "fish", "reptile", 
                                                            "mammals", "amphibians_xyz"), volume = c(1234L, 432L, 324L, 333L, 
                                                                                                     2223L, 412346L, 7456L, 3456L, 2345L, 2345L, 6L, 345L, 23L, 2L, 
                                                                                                     4778L, 234L, 8675L, 3459L, 8L, 9L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                              -20L))

I then rename column 1 and 2 in book2 so that it matches that of book1 by names(book2)[1:2] <- c('id','title') where I can later do inner_join. The desired output is shown in the image below by

library(dplyr)
book1 %>%
  inner_join(book2, by = c("id", "title")) 

enter image description here

This is taking quite a few steps and wondering if there's a simplified version to this?

Catalyst
  • 426
  • 3
  • 12
  • 1
    It looks like you're reading in a CSV or other text file. I think it would be better to get the column names you want during the import rather than trying to wrangle them once you've imported the (undesirable) current format. if I'm correct, please provide a sample input file and the code you use to read it. – Limey Jun 25 '22 at 11:03
  • For research purpose I think it's better to retain the original csv file which anyone can download from the primary source and then I provide R script on how I do the data wrangling for further analysis. It's more transparent this way. – Catalyst Jun 25 '22 at 11:31

2 Answers2

0

Something like this?

# split the data by columns
book2a <- book1[-(1:4), 1:3]
book2b <- book1[-(1:4), 4:6]

# take care of names
names(book2a) <- book1[4, 1:3, drop = TRUE]
names(book2b) <- book1[4, 4:6, drop = TRUE]

# book2b needs processing
book2b$title <- sub("2", "", book2b$title)
book2b$index <- as.numeric(book2b$index)

# join both data sets and clean-up
book2 <- merge(book2a, book2b, all = TRUE)
rm(book2a, book2b)

book2
#>           id    title hazard     index
#> 1  632592651     asdf      y 0.6648838
#> 2  633322173      cat      y 0.2008978
#> 3  634703802      dog      y 0.7522281
#> 4  634927873    mouse      n 0.1247293
#> 5  635812953 elephant      n 0.6262851
#> 6  636004739    goose      y 0.1345379
#> 7  636101211      rat      y 0.6125268
#> 8  636157799     mice      n 0.7696225
#> 9  636263106    kitty      n 0.6825325
#> 10 636752420   kitten      y 0.8190157

Created on 2022-06-25 by the reprex package (v2.0.1)

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

Found the solution to the first question

library(janitor)
book1 <- row_to_names(dat=book1, row_number=4, remove_row = TRUE, remove_rows_above = TRUE)

I applied

names(book1)[4:5] <- c('id1','title1')

to obtain unique column name, then tried inner_join as proposed earlier but with error and found that book1$id is character where book2$id is int and so I did

book1$id <- as.integer(book1$id)

and finally it works with

library(tidyverse)
Yeah <- book1 %>%
  inner_join(book2, by = c("id", "title")) 

Output below:

 id    title hazard       id1    title1       index volume
1 633322173      cat      y 633322173      cat2  0.20089779    432
2 634703802      dog      y 634703802      dog2 0.752228086    324
3 634927873    mouse      n 634927873    mouse2 0.124729276    333
4 635812953 elephant      n 635812953 elephant2 0.626285086   2223
5 636004739    goose      y 636004739    goose2 0.134537909 412346
6 636101211      rat      y 636101211      rat2 0.612526768   7456
7 636157799     mice      n 636157799     mice2 0.769622463   3456
8 636263106    kitty      n 636263106    kitty2 0.682532524   2345
9 636752420   kitten      y 636752420   kitten2 0.819015658   2345

Still wondering if there's a quicker way?

Catalyst
  • 426
  • 3
  • 12