1

I am starting a customer lifetime project at work and want to share how the data looks with the business, as I want to be able to identify the important variables with them. I plan to do this using the excellent rpivottable package and launch a shiny app to see where there are basic differences in groups to select my features.
This would mean I have my customer base of 4million customers and slice and dice them in a number of ways.

However, following GDPR we need to ensure no group is shown that has less than 7 customers in it. Therefore I need somekind of background calculation to ensure that less than 7 customers are never shown.

If I think logically about this, the only way I could see it working would be to make a change to the pivottable, have some form of submit button, so that the size of groups could be calculated, and then a filter (which needs to be hidden from the user so it cannot be switched off) is applied.

I know I should provide code, but I do not know where to start here. Has anyone had similar issues and has a potential solution to all or part of the problem? Has anyone built a hidden filter into their rpivottable? Has anyone been able to restrict their output to only show 90% of their data?

Thanks,
J

James Oliver
  • 547
  • 1
  • 4
  • 17

1 Answers1

0

To be absolutely sure, you would need to load in a data frame that looks like "dim, dim, dim, count" where count is always greater than 7. Basically just a bit of preprocessing on your input data. Unfortunately, this means that you will be restricted to a small number of coarse dimensions, else you will end up filtering out everything.

nicolaskruchten
  • 26,384
  • 8
  • 83
  • 101
  • This is exactly what I want to avoid and why I choose shiny. Shiny means the data is stored on premise vs using PowerBI where it is in Microsofts cloud. I want to have dynamic data exploration available to my data science team, and the business... – James Oliver Jan 17 '19 at 10:04
  • Aren't you one of the authors of rpivotTable... this suggests to me that what I want is not doable... is there something I can do which might not 100% sure? – James Oliver Jan 17 '19 at 10:06
  • 1
    I am the author of the JS library which powers rpivotTable, and I help out with the R bit also, yes. What you're trying to do is probably not easy if possible at all IMO unfortunately. Keep in mind that with rpivotTable, the raw data is leaving your server and getting sent to the browser, so you're likely to leak a LOT of information if you don't filter on the server/in the dataframe. – nicolaskruchten Jan 17 '19 at 14:16
  • What do you mean by leak information? Do you mean those that fall under the threshold of 7 cases or do you mean it in another way? I thought hosting it locally and then giving access via shiny would avoid data leakage – James Oliver Jan 20 '19 at 12:53
  • What do you mean by leak information? Do you mean those that fall under the threshold of 7 cases or do you mean it in another way? I thought hosting it locally and then giving access via shiny would avoid data leakage – James Oliver Jan 20 '19 at 12:53
  • The way that rpivotTable works is that all records are sent from the server to the browser and aggregated there. So anyone who does "view source" or opens developer tools will have access to the full dataframe you pass to rpivotTable. If you're concerned about disclosure, you need to take this into account :) – nicolaskruchten Jan 21 '19 at 16:22