5

For a project, I need to replicate some results that currently exist in Stata output files (.dta) and were computed from an older Stata script. The new version of the project needs to be written in Python.

The specific part I am having difficulty with is matching quantile breakpoint calculations based on the weighted version of Stata's xtile command. Note that ties between data points won't matter with the weights, and the weights I am using come from a continuous quantity, so ties are extremely unlikely (and there are no ties in my test data set). So miscategorizing due to ties is not it.

I have read the Wikipedia article on weighted percentiles and also this Cross-Validated post describing an alternate algorithm that should replicate R's type-7 quantiles.

I've implemented both of the weighted algorithms (code at the bottom), but I'm still not matching very well against the computed quantiles in the Stata output.

Does anyone know the specific algorithm used by the Stata routine? The docs did not describe this clearly. It says something about taking a mean at flat portions of the CDF to invert it, but this hardly describes the actual algorithm and is ambiguous about whether it's doing any other interpolation.

Note that numpy.percentile and scipy.stats.mstats.mquantiles do not accept weights and cannot perform weighted quantiles, just regular equal-weighted ones. The crux of my issue lies in the need to use weights.

Note: I've debugged both methods below quite a lot, but feel free to suggest a bug in a comment if you see one. I have tested both methods on smaller data sets, and the results are good and also match R's output for the cases where I can guarantee what method R is using. The code is not that elegant yet and too much is copied between the two types, but all that will be fixed later when I believe the output is what I need.

The problem is that I don't know the method Stata xtile uses, and I want to reduce mismatches between the code below and Stata xtile when run on the same data set.

Algorithms that I've tried:

import numpy as np

def mark_weighted_percentiles(a, labels, weights, type):
# a is an input array of values.
# weights is an input array of weights, so weights[i] goes with a[i]
# labels are the names you want to give to the xtiles
# type refers to which weighted algorithm. 
#      1 for wikipedia, 2 for the stackexchange post.

# The code outputs an array the same shape as 'a', but with
# labels[i] inserted into spot j if a[j] falls in x-tile i.
# The number of xtiles requested is inferred from the length of 'labels'.


# First type, "vanilla" weights from Wikipedia article.
if type == 1:

    # Sort the values and apply the same sort to the weights.
    N = len(a)
    sort_indx = np.argsort(a)
    tmp_a = a[sort_indx].copy()
    tmp_weights = weights[sort_indx].copy()

    # 'labels' stores the name of the x-tiles the user wants,
    # and it is assumed to be linearly spaced between 0 and 1
    # so 5 labels implies quintiles, for example.
    num_categories = len(labels)
    breaks = np.linspace(0, 1, num_categories+1)

    # Compute the percentile values at each explicit data point in a.
    cu_weights = np.cumsum(tmp_weights)
    p_vals = (1.0/cu_weights[-1])*(cu_weights - 0.5*tmp_weights)

    # Set up the output array.
    ret = np.repeat(0, len(a))
    if(len(a)<num_categories):
        return ret

    # Set up the array for the values at the breakpoints.
    quantiles = []


    # Find the two indices that bracket the breakpoint percentiles.
    # then do interpolation on the two a_vals for those indices, using
    # interp-weights that involve the cumulative sum of weights.
    for brk in breaks:
        if brk <= p_vals[0]: 
            i_low = 0; i_high = 0;
        elif brk >= p_vals[-1]:
            i_low = N-1; i_high = N-1;
        else:
            for ii in range(N-1):
                if (p_vals[ii] <= brk) and (brk < p_vals[ii+1]):
                    i_low  = ii
                    i_high = ii + 1       

        if i_low == i_high:
            v = tmp_a[i_low]
        else:
            # If there are two brackets, then apply the formula as per Wikipedia.
            v = tmp_a[i_low] + ((brk-p_vals[i_low])/(p_vals[i_high]-p_vals[i_low]))*(tmp_a[i_high]-tmp_a[i_low])

        # Append the result.
        quantiles.append(v)

    # Now that the weighted breakpoints are set, just categorize
    # the elements of a with logical indexing.
    for i in range(0, len(quantiles)-1):
        lower = quantiles[i]
        upper = quantiles[i+1]
        ret[ np.logical_and(a>=lower, a<upper) ] = labels[i] 

    #make sure upper and lower indices are marked
    ret[a<=quantiles[0]] = labels[0]
    ret[a>=quantiles[-1]] = labels[-1]

    return ret

# The stats.stackexchange suggestion.
elif type == 2:

    N = len(a)
    sort_indx = np.argsort(a)
    tmp_a = a[sort_indx].copy()
    tmp_weights = weights[sort_indx].copy()


    num_categories = len(labels)
    breaks = np.linspace(0, 1, num_categories+1)

    cu_weights = np.cumsum(tmp_weights)

    # Formula from stats.stackexchange.com post.
    s_vals = [0.0];
    for ii in range(1,N):
        s_vals.append( ii*tmp_weights[ii] + (N-1)*cu_weights[ii-1])
    s_vals = np.asarray(s_vals)

    # Normalized s_vals for comapring with the breakpoint.
    norm_s_vals = (1.0/s_vals[-1])*s_vals 

    # Set up the output variable.
    ret = np.repeat(0, N)
    if(N < num_categories):
        return ret

    # Set up space for the values at the breakpoints.
    quantiles = []


    # Find the two indices that bracket the breakpoint percentiles.
    # then do interpolation on the two a_vals for those indices, using
    # interp-weights that involve the cumulative sum of weights.
    for brk in breaks:
        if brk <= norm_s_vals[0]: 
            i_low = 0; i_high = 0;
        elif brk >= norm_s_vals[-1]:
            i_low = N-1; i_high = N-1;
        else:
            for ii in range(N-1):
                if (norm_s_vals[ii] <= brk) and (brk < norm_s_vals[ii+1]):
                    i_low  = ii
                    i_high = ii + 1   

        if i_low == i_high:
            v = tmp_a[i_low]
        else:
            # Interpolate as in the type 1 method, but using the s_vals instead.
            v = tmp_a[i_low] + (( (brk*s_vals[-1])-s_vals[i_low])/(s_vals[i_high]-s_vals[i_low]))*(tmp_a[i_high]-tmp_a[i_low])
        quantiles.append(v)

    # Now that the weighted breakpoints are set, just categorize
    # the elements of a as usual. 
    for i in range(0, len(quantiles)-1):
        lower = quantiles[i]
        upper = quantiles[i+1]
        ret[ np.logical_and( a >= lower, a < upper ) ] = labels[i] 

    #make sure upper and lower indices are marked
    ret[a<=quantiles[0]] = labels[0]
    ret[a>=quantiles[-1]] = labels[-1]

    return ret
ely
  • 74,674
  • 34
  • 147
  • 228
  • Another sub-question that arises is: if you are making xtiles on a particular column of data, but other columns of the data have some values missing in Stata's working memory, will Stata ignore the rows that have missing data when forming the breakpoints, or not? I have small evidence to suggest not, but cannot find confirmation. – ely Jul 23 '12 at 13:51
  • Note also the thread at http://stackoverflow.com/questions/11347539/getting-scipy-quantiles-to-match-stata-xtile-command – Nick Cox Jul 09 '13 at 08:04

2 Answers2

2

Here's a screenshot of the formulas from Stata 12 manuals (StataCorp. 2011. Stata Statistical Software: Release 12. College Station, TX: StataCorp LP, p. 501-502). If this does not help, you might ask this question on Statalist or contact Philip Ryan (the author of the original code) directly.

enter image description hereenter image description here

dimitriy
  • 9,077
  • 2
  • 25
  • 50
  • Given that these manuals are for-pay, I'm concerned that this is some sort of content violation. I can't find this manual for free anywhere, so it's a good bet that reproducing it here is illegal. – ely Jul 25 '12 at 19:11
  • If you want to summarize the algorithm mentioned in the first part (where it always takes the x-value serving as an upper bound on the quantile bracket, unless it's cumulative weight is exactly equal to the breakpoint, whence it does a simple average), then I'll accept that answer. You can just reproduce the math from the first part, and ignore the altdef part. But probably the image itself isn't a good idea. – ely Jul 25 '12 at 19:14
  • Also, FWIW, this default algorithm is a pretty terrible quantile algorithm. Always breaking by using the upper percentile bracket is really bad, so it's no wonder I was seeing discrepancies for every different R-style quantile method I tried. – ely Jul 25 '12 at 19:16
  • I have no opinion on the quantile algorithm. I am reasonably certain this constitutes fair use as I am not giving you the entire manuals. This is analogous to citing a passage in a book. In any case, many academic institutions in the US are posting manuals on-line that can be found with a simple google search (such as stata manual xtile type:pdf). They must have better lawyers. For example, version 11 manual that contains the formulas for xtile are available here: www.agecon.ksu.edu/support/Stata11Manual/d.pdf. – dimitriy Jul 25 '12 at 20:49
  • Yes, given the existence of other such manuals online at the sites of institutions like Kansas State, I agree that it's fair use. Thank you for the pointer. I had done quite a lot of searching for xtile documentation prior to posting and could find nothing like this in the hours I devoted to looking. – ely Jul 25 '12 at 22:19
0

Did you know that you can just read Stata's code?

. ssc install adoedit
. adoedit xtile
Keith
  • 1,037
  • 6
  • 13
  • I don't have a functioning Stata license. I just have the .dta output files. It may be possible for me to get the licensed version working, but for the moment, assume I am sans Stata. I am using the scikits.statsmodels function `genfromdta()` to import saved Stata output, and I can view the script that generated this data, but I can't do anything with it outside of another language (in my case, Python). – ely Jul 24 '12 at 01:55
  • Also, while this may technically work, I would be shocked if a historically successful, for-profit software product like Stata did not have an adequate mathematical description of its methods. It's not very reasonable to expect people to pay for a Stata license just to know the specific math algorithm that Stata is using... how do I know if I want to pay for it!? – ely Jul 24 '12 at 02:00
  • Compare with Matlab where you can almost instantly find the precise math description of the built-in functions, even if you don't own a Matlab license. I'm not really pro Matlab, but surely Stata can't be so bad as to require `adoedit` just to know what algorithm it is using? – ely Jul 24 '12 at 02:01
  • Did you not look in the Stata manuals???? I assumed you had! There's almost always good documentation there. The PDF version of the manuals are now built into Stata. http://www.stata-press.com/manuals/ – Keith Jul 25 '12 at 13:10
  • I think you're not understanding. I don't have Stata, so I can't read the built-in PDFs. I am trying to use the Stata documentation online and it is not clear enough. I appreciate your input, and if I ever do have a Stata license in the future, I will try to use their provided code and PDFs for help. However, what should I do in my current situation where I do not have Stata? – ely Jul 25 '12 at 13:58