How do I tabulate a variable in Stata to show all values that are in my sample, even if they're not yet in the dataset?

Question

I am trying to tabulate a variable for a report at work to show the completion rate by municipality of a survey that is currently in the field. A dataset is downloaded each week showing all the responses so far, and I need to generate a table that shows frequencies for all the municipalities in our sample, even if they are zero. Using the missing option doesn't work because the dataset the report is made from doesn't have any way of telling stata what the overall sample of municipalities is.

Is there a way to get it to show zero for values that aren't in the dataset, based on a list of values in the overall sample? I'm competent in R so if someone has a suggestion for how to do this in R that would be good too.

So far this is my code:

convert string variable to numeric variable with value labels

    encode municipality_name, gen(municipality)

export survey completion by municipality to .doc file

    asdoc tab municipality, mis

in R you set the vector or column as a factor, and give it a defined levels — StupidWolf, Jan 22 '20 at 21:46
If it is present in the level, but not in the vector or column, tabling it returns 0; for example, x = factor(c("a","b"),levels=c("a","b","c")) ; table(x) — StupidWolf, Jan 22 '20 at 21:47
I would suggest using `contract` to obtain frequencies for each municipality, then merge the resulting dataset with another dataset that contains the complete list of all possible municipalities. Keep all values of `_merge`. Set municipalities that were only present in the `using` dataset to zero. — Bicep, Jan 22 '20 at 21:51
Cybernike, using contract and merge seems like it will work, but I need to uniquely identify observations. I got the following error, probably because my dataset with the complete list only has one variable. Error: "variable municipality_name does not uniquely identify observations in the using data" — Reed Merrill, Jan 23 '20 at 02:08
The using dataset with the list of municipalities should contain only one row per municipality name. Have you checked that this is the case? — Bicep, Jan 23 '20 at 02:47

score 3 · Accepted Answer · answered Jan 23 '20 at 09:21

The community-contributed command fre from SSC allows labelled values that do not occur in the dataset to be tabulated as having zero frequency. Here is an example:

. sysuse auto, clear
(1978 Automobile Data)

. fre foreign

foreign -- Car type
----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
-------------------+--------------------------------------------
Valid   0 Domestic |         52      70.27      70.27      70.27
        1 Foreign  |         22      29.73      29.73     100.00
        Total      |         74     100.00     100.00           
----------------------------------------------------------------

. label def origin 42 "Extraterrestrial", add

. fre foreign

foreign -- Car type
----------------------------------------------------------------
                   |      Freq.    Percent      Valid       Cum.
-------------------+--------------------------------------------
Valid   0 Domestic |         52      70.27      70.27      70.27
        1 Foreign  |         22      29.73      29.73     100.00
        Total      |         74     100.00     100.00           
----------------------------------------------------------------

. fre foreign, includelabeled

foreign -- Car type
-------------------------------------------------------------------------
                            |      Freq.    Percent      Valid       Cum.
----------------------------+--------------------------------------------
Valid   0  Domestic         |         52      70.27      70.27      70.27
        1  Foreign          |         22      29.73      29.73     100.00
        42 Extraterrestrial |          0       0.00       0.00     100.00
        Total               |         74     100.00     100.00           
-------------------------------------------------------------------------

. ssc desc fre

--------------------------------------------------------------------------------------------------------
package fre from http://fmwww.bc.edu/repec/bocode/f
--------------------------------------------------------------------------------------------------------

TITLE
      'FRE': module to display one-way frequency table

DESCRIPTION/AUTHOR(S)

        fre displays, for each specified variable, a univariate
      frequency table containing counts, percent, and cumulative
      percent. Variables may be string or numeric. Labels, in full
      length, and values are printed. By default, fre only tabulates
      the smallest and largest 10 values (along with all missing
      values), but this can be changed. Furthermore,  values with zero
      observed frequency may be included in the  tables. The default
      for fre is to display the frequency  tables in the results
      window. Alternatively, the tables may be written to a file on
      disk, either tab-delimited or  LaTeX-formatted.

      KW: data management
      KW: frequencies
      KW: frequency table
      KW: tabulation

      Requires: Stata version 9.2

      Distribution-Date: 20150603

      Author: Ben Jann, University of Bern
      Support: email jann@soz.unibe.ch


INSTALLATION FILES                                  (type net install fre)
      fre.ado
      fre.hlp

ANCILLARY FILES                                     (type net get fre)
      fre.zip
--------------------------------------------------------------------------------------------------------
(type ssc install fre to install)

Hey, thanks Nick Cox. This might be just what I was looking for. The list of value labels isn't super small (about 400) so if I enter them then is there a way for them to be stored between different iterations of the report I'm doing? — Reed Merrill, Jan 25 '20 at 18:52
I don’t understand the new question. Whatever value labels have been defined are part of the dataset. That’s how fre can access them. There are other ways to access them too, through Mata for example. — Nick Cox, Jan 25 '20 at 19:52
Oh, I see. Thanks! What is "origin" in your label def command? I couldn't find a variable called origin in the "auto" example dataset, but I don't see how else you could specify which variable to add labels to unless that command line had the variable name in it somewhere. I know I'm missing something, sorry about this. — Reed Merrill, Jan 26 '20 at 02:31
No; `origin` is the name of the value labels. In essence you must have defined value labels for your question to make sense. So, either you know what name was used for the value labels, or you look it up. `describe foreign` is one way: there are others. I needed a quick reproducible example in which there was a value label not matched by data. In your case, that state of affairs is what you have already. Clearly, you didn't use a reproducible data example, so I had to make one up. — Nick Cox, Jan 26 '20 at 08:36
See `help label` to learn that the label name does not have to be the same as the variable name. — Nick Cox, Jan 26 '20 at 09:18
Perhaps it is not obvious to you that examples starting with a `sysuse` or `webuse` are examples that you can run yourself in Stata. In general, once a command is mentioned, you are expected to be able and willing to access the `help` and/or the manual entry. — Nick Cox, Jan 26 '20 at 11:07
Right, thanks. That makes more sense now. And thanks for clarifying some Stack Overflow norms at the same time. — Reed Merrill, Jan 26 '20 at 21:00
@ReedMerrill Looks like Dr. Cox answered all your questions. If that is the case, please select his response as the answer using the check mark on the left. — dimitriy, Jan 27 '20 at 01:34

How do I tabulate a variable in Stata to show all values that are in my sample, even if they're not yet in the dataset?

1 Answers1