0

I was doing all this in Python, but I'm leaving the company and I want to leave behind a process that someone even less tech-y than I can manage.

I've written instructions on how to consolidate several hundred Excel worksheets using Power Query (Get & Transform Data) in Excel. I then have them 'close & load' into an Excel worksheet...the data consists of about 9000 rows with >20 columns. One column, called 'Name', = colleague name (there's about 20 unique names).

Now I need to get a random 10% of each colleague's rows. (In Python I used groupby & sample, here.) I can do this in the Excel worksheet by adding a 'helper row' called 'Rand No' that contains the formula =RAND(), and then using this array formula (Control+Shift+Enter): =IF([@[Rand No]]>=LARGE((--([@Name]=[Name])*[Rand No]),COUNTIF([Name],[@Name])/10),"Randomised",""). But I was just wondering how you could do this randomisation in PowerQuery, since we already had the PQ editor open when we consolidated all those sheets. I'm guessing it would be more complicated than just doing it with an ordinary Excel formula--but am I wrong?

P E
  • 165
  • 11
  • What have you tried? I'd think `Group by Name` => `add a column with random numbers to each subgroup` => `sort by that column` => `Select the first 10% of the rows` should do it. – Ron Rosenfeld Nov 09 '21 at 12:30

1 Answers1

0

Here is a slightly different approach.

  • I create a List of Random Numbers of the same count as the number of table rows
  • Add that List as a column to the original table
  • Then group by Name, sort each subgroup by the random column, and remove all but N% of the rows.

This will usually provide different output for each time the query is refreshed.

In my query, I generated the data, as a list of Names and List of Values, but you can easily read the data from any source.

M Code
edited per @horseyride suggestion

let

//read in the table.
// I created a table, but you can obtain it from any source
    Source = Table.FromColumns({
                List.Repeat({"A".."Z"},10), List.Numbers(0,260)},
                type table[Name=text,Value=Int64.Type]),

//Add buffered random number column to the table
    random = List.Buffer(List.Random(Table.RowCount(Source))),
    newTbl = Table.FromColumns(
            Table.ToColumns(Source) & {random},
            type table [Name=text, Value=Int64.Type, Random=number]
    ),

//Group by Name, then extract the top N by sorting and removing (1-N) rows
    N=0.1,

    grp = Table.Group(newTbl,"Name",{{
        "group", each Table.RemoveRows(Table.Sort(_,{"Random",Order.Descending}),0, 
           Number.RoundTowardZero(Table.RowCount(_)*(1-N)))
    }}),

//Expand the Value Column
//  You may need to expand more columns -- just don't check Name or Random
    #"Expanded group" = Table.ExpandTableColumn(grp, "group", {"Value"}, {"Value"})
in
    #"Expanded group"
Ron Rosenfeld
  • 53,870
  • 7
  • 28
  • 60
  • Does Table.RowCount(_)*(1-N) work for you when the number of rows is not an even multiple? (87 rows with n=.10, would be 8.7 rows to remove) Thats why I was attempting to use Number.IntegerDivide() on table rowcount – horseyride Nov 10 '21 at 21:00
  • @horseyride Thank you for pointing that out. I corrected the problem by rounding towards zero. – Ron Rosenfeld Nov 10 '21 at 21:22
  • Ha! all these years and I never noticed that function! – horseyride Nov 10 '21 at 21:24
  • This is a great suggestion. To have a list of values, sort it in a random way, and then pick the TopN values. One suggestion. you can actually sort a list randomly without adding a separate column to sort the list by. You can use the optional second parameter that takes a function. You would then use List.Sort( { Yourlist}, each Number.Random() ). You can find some more angles on this in this article: https://gorilla.bi/power-query/generate-random-numbers/ – Rick de Groot - gorilla.bi Feb 11 '23 at 13:40