4

In Google Analytics, I am able to get a list of all the terms users search for on the site. For a large site over the course of several weeks, this could be upwards of 10,000 terms. I want to create a report that categorizes the types of terms that users searched for, but going through 10,000 terms and categorizing them by hand would be difficult in a reasonable timeframe. So my instinct was the sample and report on that sample.

I want to make sure I am using the right formula to generate a margin of error for the sample and that I am properly reporting it.

What I want to do is pull a random sample of the terms used, then put those terms into a spreadsheet of some kind and code them by hand in the categories (products, personnel, jobs). In the end, I'll have categories with some percentage of the sample for each sampled term.

For a 95% confidence, I was going to use:

Margin of error = (1.96 * 0.5) / sqrt((population_total_count - 1) * sample_search_total_count / (population_total_count - sample_search_total_count))

population_total_count would be the total count of search in the population (the full list) and sample_search_total_count would be the number of searches in a random sample I pull.

If 25% of my sample percentage was "products", and I had a Margin of Error 3%, I would report that as "We expect 25% of searches were for products plus or minus 3% at a 95% confidence." I would the same "plus or minus 3% at a 95% confidence" for any of the other categories in the same survey.

Am I using the right formula and discussing this correctly? Am I correct in using the same +/- Margin of Error for each of the categories?

JAB
  • 115
  • 1
  • 1
  • 9

1 Answers1

0

From the "1.96", I can tell you're assuming your data follow normal distributions, which isn’t necessary (and will be too crude an approximation for small datasets).

You should instead use one of the following three approaches:

  1. A Dirichlet-multinomial model, if the data can be modelled as being generated all from one similar process (i.e. you assume all users' search behaviour is similar), or you are happy to treat them as such.

  2. A mixture of Dirichlet distributions, if you know, or suspect, that there are two or several types of data (e.g. a group of children and a group of adults who are entering the search terms, and you don’t know who is whom).

  3. A confidence interval for multinomial proportions, if you are in a hurry and seek an off-the-shelf frequentist technique. An example tool is the MultinomCI function in R. See for example Confidence Intervals for Multinomial Proportions in DescTools20.

Reference for the above three methods: The Datatrie Advisor. Good luck!

Mark Ebden
  • 376
  • 1
  • 3
  • 9