Data wrangling: How to merge two wide format datasets into one

Question

I have two wide format datasets. Both share a common index column and I would like to combine both datasets into a single wide format dataset based on this common column. A sample of the datasets is provided below.

Let dataset A be:

Wherein column 1 is a list of documents and the wide columns are themes found in those documents (1 if the document mentions the theme, 0 if otherwise)

And let dataset B be:

Wherein column 1 is the same list as in dataset A, and the other columns are countries. Values are a custom code for how "important" that document was for that particular country (e.g. 5 is very important, 1 is unimportant, 0 is no participation in the document).

I would like to combine both into a single wide dataset, wherein rows are countries and columns are the themes. The values inside the cells would equal the sum of occasions when a country engaged with a theme through a document, weighted by the "importance" coding on datset B.

A finished dataset would look like this:

For instance, AFG only participated in document A/C.3/64/L.6, with importance 5, and as this document only mentioned "Access to information", AFG's engagement with the topic is 5. AND, in turn, participated in all documents and so gained an engagement with topics "Abduction" (1*5=5), "Abortion" (1*1=1), "Academic degree" (1*1 + 1*2=3) and "Access to information" (1*4=4).

The problem is that the full datasets A and B have over 1k themes and 190 countries, respectively. So I need to find an automated way to do this merger. I would be grateful for advice on how to do this either on Excel or on R.

Many thanks

Please add data using `dput` and not as images. Show the expected output for the same. Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). — Ronak Shah, Apr 05 '20 at 02:50
Not clear how the data in dataset A can be attributed to countries. Hence I failed to tie back your result formula for AND, "Academic degree" (1*1 + 1*2=3), to the data. — Variatus, Apr 05 '20 at 03:58

score 1 · Answer 1 · answered Apr 05 '20 at 02:48

1

First I would suggest you look into what makes Tidy data, you should have a column named "country", instead of the rows being countries.

try using the "dplyr" package in the tidyverse...

inner_join('Dataset A','Dataset B", by = "column_name")

answered Apr 05 '20 at 02:48

Daniel_j_iii

3,041
2
11
27

thanks Daniel. I do not intend, however, to have the data in tidy format since I intend to use it for network analysis and correspondence analysis, and the tools for those types of analysis don't ingest tidy but wide. – Rafael Apr 05 '20 at 18:02
Good to know. Thank you for sharing – Daniel_j_iii Apr 05 '20 at 23:58

score 1 · Answer 2 · answered Apr 05 '20 at 04:47

Very simple to do in Excel, assuming you have a version with Power Query (basically any since 2010).

Assuming you have the datasets in the current file, formatted as Excel tables, click on Dataset one, go to Data => "Get & Transform Data" and choose "From Table/Range"

Once PQ loads up, select column 1 and got to Transform = > Unpivot Columns => Other Columns

Which will result in an unpivoted dataset1

Repeat the above 3 steps, starting with the dataset2 table selected, which will result in an unpivoted dataset2

Then select Dataset1 in PQ, go to Home => Merge Queries => Merge Queries as New. In the Merge Window, join the 2 datasets on Column1

Next step is to expand the DataSet2 column, unselect Column1 as it's not needed.This will give you an expanded table.

Select the Value and Dataset2.Value columns and go Add Column => Standard => Multiplication

Select the Attribute, Dataset2.Attribute and Multiplication columns and go to Home => Remove Columns => Remove Other Columns

Select the Attribute columns, then go to Home => Pivot Column. In the Pivot Columns Option window select Multiplication as the Values column and Sum as the Aggregate Value Function.

Which will give you the required output

Then lastly, go to the top left button in the PQ window and choose Close & Load => Close and Load To and choose to load as a table on a new sheet, which will give you your final output

The first part will vary a bit depending on 'where' the data is - ie in .csv files, separate Excel workbooks, in a databse etc, but that can be adjusted to suit. Also, you can change column names etc.

score 1 · Accepted Answer · answered Apr 05 '20 at 05:34

For your example data, which assumes that A and B have the same rows and in the same order, the following code gives you your desired output.

t(t(as.matrix(A[,-1])) %*% as.matrix(B[,-1]))

    abduction abortion academic access
AFG         0        0        0      5
AGO         0        4        4      0
ALB         0        3        3      1
AND         5        1        3      4

This uses matrix multiplication.

For your real data sets (1K themes and 190 countries), you will have to check that both contain the same number of rows and that they are in the same order based on the common index column. If not, then you would need to keep only those records that have the same and then sort them by that common index column. That is trivial.

Data

> A
  id abduction abortion academic access
1  A         1        0        0      0
2  B         0        1        1      0
3  C         0        0        1      0
4  D         0        0        0      1

> B
  id AFG AGO ALB AND
1  A   0   0   0   5
2  B   0   4   3   1
3  C   0   0   0   2
4  D   5   0   1   4

A <- structure(list(id = structure(1:4, .Label = c("A", "B", "C", 
"D"), class = "factor"), abduction = c(1, 0, 0, 0), abortion = c(0, 
1, 0, 0), academic = c(0, 1, 1, 0), access = c(0, 0, 0, 1)), row.names = c(NA, 
-4L), class = "data.frame")

B <- structure(list(id = structure(1:4, .Label = c("A", "B", "C", 
"D"), class = "factor"), AFG = c(0, 0, 0, 5), AGO = c(0, 4, 0, 
0), ALB = c(0, 3, 0, 1), AND = c(5, 1, 2, 4)), row.names = c(NA, 
-4L), class = "data.frame")

thanks Edward. Indeed matrix multiplication seems to be the most all-around solution. It would work in Excel and R. — Rafael, Apr 05 '20 at 18:09

Data wrangling: How to merge two wide format datasets into one

3 Answers3