Sorting a collaborative database

Question

I'm working through a database with several repeated IDs entries and unique information per row as such

ID	country
A1	AT
A1	BE
A2	CZ
A3	US
A3	UK
A4	NZ

The same ID and different country implies there was collaboration between two different entities in this project.

How can I treat the dataset to tabulate collaborations? I would like to obtain

   AT BE CZ NZ US UK
AT    1
BE 1
CZ       
NZ             1  1
US           1    1  
UK           1 1


Preferably I could have the option to remove duplicates to get

   AT BE CZ NZ US UK
AT    1
BE 
CZ       
NZ             1  1
US           
UK

and somehow get a variable to produce 'summarize'.

Thank you in advance (I can't get access to Stata Forum due to a firewall)

I tried creating variables that identify multi-country IDs and another that reads the country when it is the second or third linked to the same ID

ID	country	var 2
A1	AT	1
A1	BE	2
A2	CZ
A3	US	1
A3	UK	2
A4	NZ	3

There are discrepancies between your example data in the table and in the example in the code snippet. But I still think I understand what you mean. My answer is based on my understanding. — TheIceBear, Mar 23 '23 at 13:37

TheIceBear · Accepted Answer · 2023-03-24T13:44:11.950

Here is a solution:

clear
input str2(id country)
"A1" "AT"
"A1" "BE"
"A2" "CZ"
"A3" "US"
"A3" "UK"
"A4" "NZ"
"A4" "US"
end

* Get locals with all country abbreviation and the number of countries
levelsof country, local(countries)
local num_countries : list sizeof countries

* Collapse each project to one row where the last row has all abbreviations 
* of countries in that project. Keep only the last row for each project
sort id country
replace country = country[_n-1] + " " + country if id[_n-1] == id
by id : keep if _n == _N

* Get a local with all combination of collaborations
levelsof country, local(combos)

* Loop over each country
foreach country of local countries {
    
    * For each country get a list of collaborations that country is part of
    foreach combo of local combos {
        if strpos("`combo'","`country'") {
            local `country'_colab "``country'_colab' `combo'"
        }        
    }
    
    *Remove duplicates from list
    local `country'_colab : list uniq `country'_colab
    *Remove the country itself from that country's list
    local `country'_colab = subinstr("``country'_colab'","`country'","",.)
    *Remove all excessive spaces for no other reason than neatness
    local `country'_colab = trim(itrim("``country'_colab'"))
    
    *Display all collaborations
    di "Country `country' collaborats with countries: ``country'_colab'"
}

* Clear the original data and create a new data set with one obs per country
clear
set obs `num_countries'

* Create one row and row column for each country. 
* Set the default to 0 for no collaboration
gen country = ""
local row = 1
foreach country of local countries {
    gen `country' = 0
    replace country = "`country'" if _n == `row++'
}

* Go over the list of collaborations for each country and set value to 1
* where the row and column country collaborates
foreach country of local countries {
    foreach colab of local `country'_colab {
        replace `colab' = 1 if country == "`country'"
    }
}

I have not cleared duplicates as you suggested as a preference as I think you get an ambivalent data set as it is not clear for which country you should drop a row and for which you should drop a column.

Sorting a collaborative database

1 Answers1