0

I loaded a 10 rows sample, having around 10 columns.

library(tidyverse)

# A tibble: 7 x 9
  case_id             scenario  alert_number random_ref_id  amount code_type is_cred_debt source type 
  <chr>               <chr>            <dbl> <chr>           <dbl>     <dbl> <chr>        <chr>  <chr>
1 2020500ZSJU45679007 Anomalies      8796964 xxxxdg6yht78lj  2137.       100 D            xdd    CASH 
2 2020500ZSJU45679007 Anomalies      8796964 xxxxdg6yht78lj  2137.       100 D            xdd    CASH 
3 2020500ZSJU45679007 Anomalies      8796964 xxxxdg6yht78lj  2137.       100 D            xdd    CASH 
4 2020500ZSJU45679007 Anomalies      8796964 xxxxdg6yht78lj  2137.       100 D            xdd    CASH 
5 2020500ZSJU45679111 Patterns       8678867 xxxykhkh67hhg   6000        200 C            CFT    WIRE 
6 2020500ZSJU45679111 Patterns       8678867 xxxykhkh67hhg   7000        200 C            CFT    WIRE 
7 2020500ZSJU45679111 Patterns       8678867 xxxykhkh67hhg  24000        200 C            CFT    WIRE 
df <-
  as.data.frame(
    structure(
      list(
        case_id = c(
          "2020500ZSJU45679007",
          "2020500ZSJU45679007",
          "2020500ZSJU45679007",
          "2020500ZSJU45679007",
          "2020500ZSJU45679111",
          "2020500ZSJU45679111",
          "2020500ZSJU45679111"
        ),
        scenario = c(
          "Anomalies",
          "Anomalies",
          "Anomalies",
          "Anomalies",
          "Patterns",
          "Patterns",
          "Patterns"
        ),
        alert_number = c(8796964, 8796964, 8796964, 8796964, 8678867, 8678867, 8678867),
        random_ref_id = c(
          "xxxxdg6yht78lj",
          "xxxxdg6yht78lj",
          "xxxxdg6yht78lj",
          "xxxxdg6yht78lj",
          "xxxykhkh67hhg",
          "xxxykhkh67hhg",
          "xxxykhkh67hhg"
        ),
        amount = c(2136.76, 2136.76, 2136.76, 2136.76, 6000, 7000, 24000),
        code_type = c(100, 100, 100, 100, 200, 200, 200),
        is_cred_debt = c("D", "D", "D", "D", "C", "C", "C"),
        source = c("xdd", "xdd", "xdd", "xdd", "CFT", "CFT", "CFT"),
        type = c("CASH", "CASH", "CASH", "CASH", "WIRE", "WIRE", "WIRE")
      ),
      class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"),
      row.names = c(NA, -7L),
      spec = structure(list(cols = list(

I would like to know whether there are techniques that - starting from this 10 rows sample - can simulate a bigger sample of let's say 100 entries, where, each observation is randomly generated.

Considering that:

  1. case_id is a random string for each observation
  2. scenario can either be Anomalies or Patterns
  3. alert_number is a random string, the same for each case_id
  4. random_ref_id is a random string, the same for each case_id
  5. amount can be a varying number between 0 and 100000
  6. code_type can either be 100 or 200, the same for each case_id
  7. is_cred_debt can either be D or C, the same for each case_id
  8. source can either be xdd or CFT, the same for each case_id
  9. type can either be CASH or WIRE, the same for each case_id

While I know how to do the other way around procedure, create a random sample from an initial df of let's say 100 observation to let's say 10, it's not clear to me how to generate a random simulation starting from this 10 observation sample.

Any hint would be very appreciated.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
chopin_is_the_best
  • 1,951
  • 2
  • 23
  • 39
  • I've edited your `dput` to be not quite so wide, but it doesn't seem to be complete - it ends with `cols = list(`. – Gregor Thomas Jan 07 '21 at 16:35
  • A couple of clarification questions: 1) is there only one unique `case_id` (according to your example data)? 2) would you like to sample such that the class proportions remain roughly the same for each variable? – latlio Jan 08 '21 at 11:25
  • @latlio I added 2 different `case_id`, this way I would like to have a bigger sample with different case_ids 2) yes, I would like the class proportions to stay similar across the variables! thanks a lot for you help! – chopin_is_the_best Jan 08 '21 at 16:28
  • I'm not sure your edit for `case_id` appears. Also, each `ref_id` is distinct, which makes sense; however your `case_id` is not distinct, for example, per your example, the same `case_id` has different values for `code_type`, `is_cred_debt`, etc. Did you mean "the same for each `ref_id`" – latlio Jan 08 '21 at 18:49
  • @latlio there you go! – chopin_is_the_best Jan 08 '21 at 22:16

1 Answers1

0

Simulating datasets is definitely not a trivial task, and there are probably a lot of things you want to consider that I'll be unable to think of at the moment.

However, here's some starter code that may give you some insights.

set.seed(2021)

# convert to type factor so I can use levels()
tidy_df <- df %>%
  mutate(across(-c(amount), as.factor))

#functionalize, and sample with replacement using original props
generate_sample <- function(df, var, n) {
  out <- sample(levels(df[[var]]),
                size = n,
                replace = T, 
                prob = 100 * prop.table(table(df[[var]])))
}

#simulate case_id, scenario, and amount
sim_df <- tibble(
  case_id = generate_sample(tidy_df, "case_id", 20),
  scenarios = generate_sample(tidy_df, "scenario", 20),
  amount = sample(0:100000, 20, replace = T)
) %>%
#because you specified that these variables should be the same with case_id, a join would work
  left_join(tidy_df %>% select(-c(scenario, amount)), by = "case_id")
latlio
  • 1,567
  • 7
  • 15