1

My data.frame looks like this:

name country1 country1 country2 country2 country3
code1 code1 code2 code2 code3
sector1 sector2 sector1 sector2 sector1
country1 ### ### ### ### ###
country2 ### ### ### ### ###
country3 ### ### ### ### ###
country4 ### ### ### ### ###

where ## are numbers.

I want to reshape it like this:

name country code sector number
country1 country2 code 2 sector 1 ###
country1 country2 code 2 sector 2 ###
country1 country3 code 3 sector 1 ###
country1 country3 code 3 sector 2 ###
country2 country2 code 2 sector 2 ###
country3 country1 code 1 sector 1 ###
country4 country3 code 3 sector 1 ###

How can I tidy up my data set? My problem is that the raw data come with this weird format in which there are multiple headers, so that I cannot simply go with pivot_longer.

Many thanks

Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
nflore
  • 196
  • 1
  • 10
  • Can you provide your data in a reproducible format using `dput` ? Read about [how to give a reproducible example](http://stackoverflow.com/questions/5963269) – Ronak Shah Jul 26 '21 at 10:37
  • You can download the data from : https://worldmrio.com/unctadgvc/ . It is the third dataset available. I opened the .txt file with `read.delim` and `sep = "/t"` – nflore Jul 26 '21 at 10:45
  • I made a final attempt to solve your problem. Take a look at the edit at the bottom of my answer. – Martin Gal Jul 26 '21 at 13:16

1 Answers1

0

I believe there are better ways but you can use dplyr:

data %>% 
  slice(1:2) %>% 
  mutate(across(starts_with("country"), ~ paste0(.x, collapse = "_")),
         name = "name") %>% 
  slice(1) %>% 
  mutate(across(starts_with("country"), ~ paste0(cur_column(), "_", .x))) %>% 
  {`colnames<-`(data, unlist(.))} %>% 
  slice(3:n()) %>% 
  pivot_longer(starts_with("country"), 
               names_to = c("country", "code", "sector"),
               names_pattern="(country\\d+).+(code\\d+).+(sector\\d+)") %>%
  filter(name != country)

returns

# A tibble: 15 x 5
   name     country  code  sector  value
   <chr>    <chr>    <chr> <chr>   <chr>
 1 country1 country2 code2 sector1 ###  
 2 country1 country2 code2 sector2 ###  
 3 country1 country3 code3 sector1 ###  
 4 country2 country1 code1 sector1 ###  
 5 country2 country1 code1 sector2 ###  
 6 country2 country3 code3 sector1 ###  
 7 country3 country1 code1 sector1 ###  
 8 country3 country1 code1 sector2 ###  
 9 country3 country2 code2 sector1 ###  
10 country3 country2 code2 sector2 ###  
11 country4 country1 code1 sector1 ###  
12 country4 country1 code1 sector2 ###  
13 country4 country2 code2 sector1 ###  
14 country4 country2 code2 sector2 ###  
15 country4 country3 code3 sector1 ###  

Note: I filtered for name != country since in your expected output those rows with matching aren't present.

Edit

One last try:

data %>% 
  slice(1:2) %>% 
  rename(name = X) %>% 
  mutate(across(c(-name), ~ paste0(.x, collapse = "_")),
         name = "name") %>% 
  slice(1) %>% 
  mutate(across(c(-name), ~ paste0(cur_column(), "_", .x))) %>% 
  {`colnames<-`(data, unlist(.))} %>% 
  slice(3:n()) %>% 
  pivot_longer(c(-name), 
               names_to = c("country", "code", "sector"),
               names_pattern="(country\\d+).+(code\\d+).+(sector\\d+)") %>%
  filter(name != country)
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • Thanks. Unfortunately the dataset I posted is just a stylized one. The original one has actual country names as labels ("Afghanistan, Albania..), so I can't use `starts_with` to group labels. I posted the link to the original dataset in the comments to my question. – nflore Jul 26 '21 at 11:34
  • Replace `starts_with("country")` by `c(-name)`. – Martin Gal Jul 26 '21 at 11:45
  • In this case an error occurs: `Can't subset columns that don't exist` – nflore Jul 26 '21 at 12:13
  • Your first column is named "name" like in your example? – Martin Gal Jul 26 '21 at 12:19
  • It was named "X" and I tried with `c(-data$X)`. Now it is named "name" and I tried with`c(-data$name)`. Same result – nflore Jul 26 '21 at 12:23
  • You don't need `data$`, since this refers to the original data.frame without any transformations. Just use `X` and `name` – Martin Gal Jul 26 '21 at 12:27
  • Just a hint: Your sample data should represent your original data. You can show your data by using `dput(head(data))`. Usally people (including me) on SO don't want to click on external links. – Martin Gal Jul 26 '21 at 12:29
  • Again an error: `Error in names(x) <- value : 'names' attribute [14841] must be the same length as the vector [14840]` – nflore Jul 26 '21 at 12:37
  • I know, I am sorry. I tried with `dput` but the dataset has 193 observations of 14840 variables, so the output of that function was unreadable. – nflore Jul 26 '21 at 12:39