Grouping observations with same ID

Question

I've taken a project in which the dataset separates the patient information and the disease observation, but the two observations have the same ID. Example:

ID: 12345 Patient age: 23, patient weight: 55, patient height: 180

ID: 12345 Disease progression: A, disease type: abc, Disease risk: 50

This goes on for every patient.

Now I want to run some statistics on this and for that I, as far as I know, have to consolidate this information in ONE observation, so we have patient characteristics and their disease characteristics in one observation. What would be the best way to do this? Bear in mind the two observations have mutually exclusive variables, so they can simply be "grouped" somehow.

* Example generated by -dataex-. To install: ssc install dataex
clear
input str32 record_id byte(treatment gender pcos aneurysmal_finding sz_anu_a_basilaris sz_anu_a_basilaris_2 sz_anu_a_basilaris_4)
"8a36ac06e58a541430cd8b31df3aeef2" . 1 0 2 . . .
"8a36ac06e58a541430cd8b31df3aeef2" . . . . . . .
"2afc1f12901992a1f973cde814615349" . 1 0 2 . . .
"2afc1f12901992a1f973cde814615349" . . . . . . .
"1e00442745c25082a64197b96065f755" . 1 0 2 . . .
"1e00442745c25082a64197b96065f755" . . . . . . .
"c90aef04e29f38fc3e21b919d5106ce8" . 1 0 2 . . .
"c90aef04e29f38fc3e21b919d5106ce8" . . . . . . .
"7cac71f3d31c7e9ec26e6a885ad554ab" . 2 0 2 . . .
"7cac71f3d31c7e9ec26e6a885ad554ab" . . . . . . .
"53c1f08aff25ace9afc46aca3263e7ca" . 1 0 2 . . .
"53c1f08aff25ace9afc46aca3263e7ca" . . . . . . .
"cdbf4328e0724f30950e437bc6bbe262" . 2 0 2 . . .
"cdbf4328e0724f30950e437bc6bbe262" . . . . . . .
"50d722dca92aee72c39c846066850a22" 1 2 0 2 . . .
"50d722dca92aee72c39c846066850a22" . . . . . . .
"ffe78f8927a81a5521f098aa077a755f" . 1 0 1 . . .
"ffe78f8927a81a5521f098aa077a755f" . . . . . . .
"aa2309be5c9b76012462fce3f43a8249" . 1 0 1 . . .
"aa2309be5c9b76012462fce3f43a8249" . . . . . . .
"4917b3d300e195b895e573474be6ccb6" . 1 0 2 . . .
"4917b3d300e195b895e573474be6ccb6" . . . . . . .
"b88557884343831060297ff4b67aeb36" . 1 . 2 . . .
"b88557884343831060297ff4b67aeb36" . . . . . . .
"ebe8ab86719aa71b68d7f0df3e451ce5" . . . 2 . . .
"8dd5267472002c796ce621984f9024ed" . . . . . . 3
"0b3e110c9765e14a5c41fadcc3cfc300" . . . . . . .
"8f58545ef8d37f290d26881743137a72" 1 2 0 2 . . .
"8f58545ef8d37f290d26881743137a72" . . . . . . .
"dcb6a27d1d4f5f1228860a76fa29e5ba" . 1 0 2 . . .
"dcb6a27d1d4f5f1228860a76fa29e5ba" . . . . . . .
"baedce78f2e736fe4d54dbdbe0460694" . 2 0 2 . . .
"baedce78f2e736fe4d54dbdbe0460694" . . . . . . .
"bb1db3b0eca9652cff3c76060b06d60b" 1 2 0 2 . . .
"bb1db3b0eca9652cff3c76060b06d60b" . . . . . . .
"6741bd218feba9de630dfe409a4e50ee" 1 2 0 2 . . .
"6741bd218feba9de630dfe409a4e50ee" . . . . . . .
"1e1425d670466e1a2c6c752d9227df17" . 2 0 2 . . .
"1e1425d670466e1a2c6c752d9227df17" . . . . . . .
"4c6672a06addc8e01842d2741be1857d" . 1 0 2 . . .
"4c6672a06addc8e01842d2741be1857d" . . . . . . .
"f1be80fbb7e4e1f5582780e25bfc8a2c" . 2 0 2 . . .
"f1be80fbb7e4e1f5582780e25bfc8a2c" . . . . . . .
"9991ec586e5f510e161fcad93fb1d79f" . 1 0 2 . . .
"9991ec586e5f510e161fcad93fb1d79f" . . . . . . .
"5c1eb56eccf9cf67ae6065f82b6eb6ce" . 1 0 2 . . .
"5c1eb56eccf9cf67ae6065f82b6eb6ce" . . . . . . .
"f9d10d2eb1951fa2ebc8b0509bb25593" . 1 0 2 . . .
"f9d10d2eb1951fa2ebc8b0509bb25593" . . . . . . .
"fbdf663512805caffe7a99d14fc9561f" . 2 0 2 . . .
"fbdf663512805caffe7a99d14fc9561f" . . . . . . .
"3b55aebe1b4b22e0c77168acc4b775dd" . 1 0 2 . . .
"3b55aebe1b4b22e0c77168acc4b775dd" . . . . . . .
"5f28194ddef4f9d057db2e4fcb7b5cf0" . 1 0 2 . . .
"5f28194ddef4f9d057db2e4fcb7b5cf0" . . . . . . .
"0b8d8253a8415275dbc2619e039985bb" . 1 0 2 5 . .
"0b8d8253a8415275dbc2619e039985bb" . . . . . . .
"4fb152c8524750b65b6717282cceb805" . 1 0 2 . . .
"4fb152c8524750b65b6717282cceb805" . . . . . . .
"ff5136e64c2110c355debca6acb74a13" . 1 0 2 . . .
"ff5136e64c2110c355debca6acb74a13" . . . . . . .
"29534fe6f18b75090b9d18f853ed7ec1" . 1 0 2 5 5 .
"29534fe6f18b75090b9d18f853ed7ec1" . . . . . . .
"8c334d2225db0661b25cf5f2c65fbcb9" . 1 0 2 . . .
"8c334d2225db0661b25cf5f2c65fbcb9" . . . . . . .
"68cf4b9f2db11cb9cf44fd0e03c53f16" . 2 . 2 . . .
"68cf4b9f2db11cb9cf44fd0e03c53f16" . . . . . . .
"6a44e65e7b1f33a3603acf2532bb40f9" . 1 0 2 . . .
"6a44e65e7b1f33a3603acf2532bb40f9" . . . . . . .
"2ed013748bf88df47c39d83bd48d8040" . 1 0 2 . . .
"2ed013748bf88df47c39d83bd48d8040" . . . . . . .
"c2f32f5b61b97d658f7b042b49b8da96" . 1 0 2 . . .
"c2f32f5b61b97d658f7b042b49b8da96" . . . . . . .
"58e1e0b5c29dee7d3739ec582d62b84c" . . . . . . .
"58e1e0b5c29dee7d3739ec582d62b84c" . . . . . . .
"8635b098d70b200fe8eef5dbf7c1c156" . 2 0 2 . . .
"8635b098d70b200fe8eef5dbf7c1c156" . . . . . . .
"266f1f1517fb50bafca92fff39c259d5" . 1 0 2 . . .
"266f1f1517fb50bafca92fff39c259d5" . . . . . . .
"d3df754a7322c02ed89f1208977a19ae" 1 2 0 2 5 . .
"d3df754a7322c02ed89f1208977a19ae" . . . . . . .
"46598c5d2da10731582d6342944e9337" . 1 0 2 . . .
"46598c5d2da10731582d6342944e9337" . . . . . . .
"8c2c5aa9b02eb1092b34cf38c2b1c83d" . 1 0 2 2 . .
"8c2c5aa9b02eb1092b34cf38c2b1c83d" . . . . . . .
"797c9cf7caf53f514f0154f34895fa80" . 1 0 2 . . .
"797c9cf7caf53f514f0154f34895fa80" . . . . . . .
"9b28a68095c520edcb56bee8aa5737b6" . 1 0 2 . . .
"9b28a68095c520edcb56bee8aa5737b6" . . . . . . .
"09e03748da35e9d799dc5d8ddf1909b5" . 1 0 2 . . .
"09e03748da35e9d799dc5d8ddf1909b5" . . . . . . .
"75d5574d8804d24932e3d0d9cbfa4b11" . 1 0 2 . . .
"75d5574d8804d24932e3d0d9cbfa4b11" . . . . . . .
"b5bda504efd4bd3b3be68513ccbf99ef" . 1 0 2 . . .
"b5bda504efd4bd3b3be68513ccbf99ef" . . . . . . .
"dc289c2a5a31355521dde31c4abd4c83" . 1 0 2 . . .
"dc289c2a5a31355521dde31c4abd4c83" . . . . . . .
"76ce83dbd64f05556e903deb54959d22" . 1 0 2 . . .
"76ce83dbd64f05556e903deb54959d22" . . . . . . .
"830ee6dd656938201f4a712607739768" . 1 0 2 . . .
end
label values treatment treatment_
label def treatment_ 1 "I blodfortyndende behandling", modify
label values gender gender_
label def gender_ 1 "Kvinde", modify
label def gender_ 2 "Mand", modify
label values pcos pcos_
label def pcos_ 0 "Nej", modify
label values aneurysmal_finding aneurysmal_finding_
label def aneurysmal_finding_ 1 "Screening", modify
label def aneurysmal_finding_ 2 "Tilfældig fund", modify
label values sz_anu_a_basilaris sz_anu_a_basilaris_
label def sz_anu_a_basilaris_ 2 "7-12 mm", modify
label def sz_anu_a_basilaris_ 5 "Uoplyst", modify
label values sz_anu_a_basilaris_2 sz_anu_a_basilaris_2_
label def sz_anu_a_basilaris_2_ 5 "Uoplyst", modify
label values sz_anu_a_basilaris_4 sz_anu_a_basilaris_4_
label def sz_anu_a_basilaris_4_ 3 "13-25 mm", modify

Essentially my problem could be solved using (assuming this syntax was correct):

gen obs1 = .
replace every variableValue in obs1 == variableValue in 12345

And then just iterating it for 1000s of observations..

Please give a data example using `dataex`. For explanation see the Stata tag wiki or https://www.statalist.org/forums/help#stata 12.2. — Nick Cox, Jul 21 '20 at 10:07
I tried using that, I haven't done so before, is this correct? See edit — user3078100, Jul 21 '20 at 10:14
Close enough. As the `help` for `dataex` explains the mark-up `[CODE]` and `[/CODE]` is for Statalist and needs to be ignored elsewhere, especially in Stata. Thanks. — Nick Cox, Jul 21 '20 at 10:16
Sometimes extra string variables should just be added to the `by()` option. but it depends.... — Nick Cox, Jul 21 '20 at 11:36

score 1 · Accepted Answer · answered Jul 21 '20 at 10:23

1

On the assumption that each identifier is represented by at most 2 observations, and that each other variable is numeric and represented by a non-missing value in at most 1 observation, you can just collapse. The default collapse by means should work fine.

ds record_id, not
collapse `r(varlist)' , by(record_id)

A more cautious approach would be to check the assumptions first!

answered Jul 21 '20 at 10:23

Nick Cox

35,529
6
31
47

Unfortunately, some identifiers are represented by several observations, varies between 1 and 5. The dataset was created such that the patient was grouped in 1 observation and then his or her aneurysms, depending on how many said patient had, was grouped into separate exclusive observations (depending on the nature of the aneurysm), however with the same ID luckily. Can you think of an approach that could still solve this problem? – user3078100 Jul 21 '20 at 10:39
Let me know and I'll edit out sensitive information and post an example of the dataset for you. – user3078100 Jul 21 '20 at 10:42
Your comment contradicts both your question and your data example, which both imply at most 2 observations per identifier. On the face of it, the `collapse` solution will still work fine so long as values in different observations don't contradict each other. – Nick Cox Jul 21 '20 at 11:03
Some variables are strings (as is the identifier) is there a way to bypass it or will I have to manually remove every string variable or replace it? – user3078100 Jul 21 '20 at 11:26
Useful extra help can't be given in comments. You need to make your data example more realistic. Alternatively, a large complicated dataset may require much more effort than can be expected from a thread on Stack Overflow, or indeed anywhere else. – Nick Cox Jul 21 '20 at 11:27
1

Thanks Nick, your solution does work. I just need to clean it up a bit more for it to fit my data perfectly. – user3078100 Jul 21 '20 at 11:31

score 0 · Answer 2 · answered Jul 21 '20 at 10:11

0

Assuming you have loaded both datasets. First dataset is called dataset1. Second dataset is called dataset2. And your ID attribute is called id

Then you can call merge them on id.

use dataset1, clear
merge id using dataset2

answered Jul 21 '20 at 10:11

tomanizer

851
6
16

There is only one dataset. I need to merge observations, not datasets. – user3078100 Jul 21 '20 at 10:15
1

It's a problem for `collapse` not `merge`. I will post an answer shortly. – Nick Cox Jul 21 '20 at 10:17

Grouping observations with same ID

2 Answers2