Is there a simple way of extract the N first words from a local macro which is comma or space+comma separated in Stata?

Question

Given a local macro that contains a string of levels which are separated by either comma (",") or comma and space (", ") or even only space (" "), is there a simple way to extract the first N levels (or words) of this local macro?

The string would look like "12, 123, 1321, 41", or "12,123,1321,41" or "12 123 1321 41".

Basically I would be happy with a version of the Macro Function word # of string that would work more or less like word 1/N of string. (See "Macro functions for parsing" in pg 12 in Macro definition and manipulation)

For more context, I am working with the output of levelsof, local() sep(). So I can choose the separator that can be worked with more easily. I want to pass the resulting levels as an argument to the inlist() function. The following usually works, but inlist() only takes up to 250 arguments. That is why I would like to extract chunks of 250 words of the results of levelsof()

sysuse auto, clear
levelsof mpg if trunk > 20, local(levels) sep(", ")
list if inlist(mpg, `levels')

"solution" so far

I have figured out a non-simple of way achieving that, but it is not looking good and I am wondering if there is a simple, built-in way of doing the same.

sysuse auto, clear

levelsof mpg if trunk > 20, local(levels) sep(", ")
scalar number_of_words = 3
forvalues i = 1 (1) `=number_of_words' {
        local word_i = `i'
        local this_level : word `word_i' of `levels'
        local list_of_levels = "`list_of_levels'`this_level'" 
        
        di as text "loop: `i'"
        di as text "this level: `this_level'"
        di as text "list of levels so far: `list_of_levels'"
    }

di "`list_of_levels'"

// trim trailing comma
local trimmed_list_of_levels = substr( "`list_of_levels'" , 1 , strlen( "`list_of_levels'" )-1) 

di "`trimmed_list_of_levels'"
list make mpg price trunk if inlist(mpg, `trimmed_list_of_levels')

output

. sysuse auto, clear
(1978 Automobile Data)

. 
. levelsof mpg if trunk > 20, local(levels) sep(", ")
12, 15, 17, 18

. scalar number_of_words = 3

. forvalues i = 1 (1) `=number_of_words' {
  2.         local word_i = `i'
  3.         local this_level : word `word_i' of `levels'
  4.         local list_of_levels = "`list_of_levels'`this_level'" 
  5.         
.         di as text "loop: `i'"
  6.         di as text "this level: `this_level'"
  7.         di as text "list of levels so far: `list_of_levels'"
  8.     }
loop: 1
this level: 12,
list of levels so far: 12,
loop: 2
this level: 15,
list of levels so far: 12,15,
loop: 3
this level: 17,
list of levels so far: 12,15,17,

. 
. di "`list_of_levels'"
12,15,17,

. 
. // trim trailing comma
. local trimmed_list_of_levels = substr( "`list_of_levels'" , 1 , strlen( "`list_of_levels'" )-1) 

. 
. di "`trimmed_list_of_levels'"
12,15,17

. list make mpg price trunk if inlist(mpg, `trimmed_list_of_levels')

     +------------------------------------------+
     | make                mpg    price   trunk |
     |------------------------------------------|
  2. | AMC Pacer            17    4,749      11 |
  5. | Buick Electra        15    7,827      20 |
 23. | Dodge St. Regis      17    6,342      21 |
 26. | Linc. Continental    12   11,497      22 |
 27. | Linc. Mark V         12   13,594      18 |
     |------------------------------------------|
 31. | Merc. Marquis        15    6,165      23 |
 53. | Audi 5000            17    9,690      15 |
 74. | Volvo 260            17   11,995      14 |
     +------------------------------------------+

edits relating to comments.

edit 01)

The following does not work, for example. It returns the error 130 expression too long.

clear 

set obs 1000
gen id = _n 
gen x1 = rnormal()

sum * 
levelsof id if x1>0, local(levels) sep(", ")
sum * if inlist(id, `levels')

example where this construction (levelsof + inlist) seems to be necessary

clear 

set obs 5000
gen id = round(_n/5)
gen x1 = rnormal()

sum * 
levelsof id if x1>2, local(levels) sep(", ")
sum * if x1>2 // if threshold is small enough, there will be too many values for inlist()
sum * if inlist(id, `levels')

You could use the `word` command in a loop, where `word` will return the nth word in a list (see `help word`). Can you elaborate on what your ultimate plan is for the `inlist`? There might be an easier way to do it. — JR96, Mar 24 '21 at 16:27
Thank you for quick reply. See edit 1 in the question. if there are more than (around) 250 levels, the function inlist returns `error 130``expression too long `. When there is less than that, it works fine. The `word` function only retrieves one word from a string. I was wandering if there is a built-in for retrieving the `n` first word**s** that would avoid using that loop I wrote in the question. The loop does work, though, if there is no built-in or easier way. — Marcelo Avila, Mar 24 '21 at 16:43
Quick answer: Push the local into Mata and using the result of `tokens()`. — Nick Cox, Mar 24 '21 at 16:46
Nick's answer is best for your ask. From my point of view it's hard to see the ultimate goal that would necessitate this over `sum * if x1>0` in relation to your edit 1. — JR96, Mar 24 '21 at 16:51
True, I will add an example where it makes a difference to go this route.. Imagine I have multiple observations of same `id`, not necessarily with the same `x1`. If I want to extract all observations (previous and past) of those `id`s that at least once had a `x1` bigger than a threshold, then I would need to , as far as I am aware, resort to this `levelsof`+ `inlist` construction. — Marcelo Avila, Mar 24 '21 at 17:09
Yup I understand you distinction now, I will put some some example code for how to vectorize this below. — JR96, Mar 24 '21 at 19:52

JR96 · Accepted Answer · 2021-03-24T20:02:22.257

2

Using your additional example as a basis, you could use egen max to create a flag that is 1 for entire id that has any cases where x1 value is above a certain threshold. For example:

clear 
set seed 2021
set obs 5000
gen id = round(_n/5)
gen x1 = rnormal()

sum * 
levelsof id if x1>2, local(levels) sep(", ")
sum * if x1>2 // if threshold is small enough, there will be too many values for inlist()
sum * if inlist(id, `levels')

//This will do the same thing
gen over_threshold = x1>2 
egen id_over_thresh = max(over_threshold), by(id)

sum * if id_over_thresh

edited Mar 24 '21 at 20:02

answered Mar 24 '21 at 19:56

JR96

953
5
12

For future uses you could shorten it to `egen id_over_thresh = max(x1>2), by(id)` – JR96 Mar 24 '21 at 20:09
very interesting idea! I will see if this applies to my actual use cases, which are a little more complexe and the if statements are based on multiple variables and conditions... (including numerical and categorical data...). In any case, while this might even solve my problem, it does not addressees the question directly... – Marcelo Avila Mar 24 '21 at 22:58
Watch out that here `x1` will be >2 if it is missing. – Nick Cox Mar 24 '21 at 23:21
Yes good point Nick on the missings. I agree that this does not directly answer your question as it sits, the push to `mata` is the best shortcut for that I can think of. The answer above should generalize quite well regardless of the complexity of the conditionals or any number of identifying variables in the `by` group. – JR96 Mar 24 '21 at 23:25
Thanks JR96. It does work with my actual data and your solution is very flexible. Quite simple but elegant solution to my problem... – Marcelo Avila Mar 31 '21 at 17:02
Great @MarceloAvila, happy it worked out for you! – JR96 Mar 31 '21 at 17:59