0

I'm pretty desperate with this so any help would be really appreciated.

I'm using R but have no problem using SAS as well if need be.

I have a data set with around 100k entries, and 30 variables. I'd like to reduce this data set to 1k entries and have the average variable values for the entire data set be as close as possible to a given number for all 30. They don't have to be exact, and I realize that's likely impossible.

2 Answers2

1

Here's an example of one specific approach using fake data. If you have more specific requirements, those should be added to your question.

First I'll make some fake data with 30 columns. For example's sake, I'll make each column have random numbers with a mean and sd corresponding to the column. The first column will have an average and sd of 1, the 2nd an average and sd of 2, etc.

columns = 30
samples = 100000
library(tidyverse)
set.seed(42)
my_data <- data_frame(row = rep(1:samples, times = columns),
                      col = rep(1:columns, each = samples),
                      value = rnorm(samples*columns, mean = col, sd = col)) %>%
                      mutate(col = paste0("c", str_pad(col, 2, pad = "0"))) %>%
                      spread(col, value)

Here I'll confirm that each row has the average and dispersion I expect:

my_data %>%
  gather(col, value, -row) %>%
  sample_n(10000) %>%    # 10k dots is enough and is quicker to plot than all of them.
  ggplot(aes(col, value)) + 
  geom_point(alpha = 0.1, size = 0.5, color = "gray40") +
  geom_boxplot(fill = NA, outlier.shape = NA) +
  scale_y_continuous(breaks = 60*-2:2)

enter image description here

For this approach, I'll rank the rows based on how far they are from the average, and keep the ones that are closest. I'll define "closest to average" based on which row has the lowest RMS difference from the averages.

Here I'll calculate how far each point is from the average for its column, then for each row find the average RMS absolute difference across all columns. I'll keep the top 5 in a data frame called most_typical, and the bottom 5 in least_typical.

my_data_how_typical <- my_data %>%
  gather(col, value, -row) %>%     # convert to long format
  group_by(col) %>%                # group by column
  mutate(dist_abs = value - mean(value)) %>%   # calc dist from mean for that col
  ungroup() %>%

  group_by(row) %>%
  summarize(avg_dist_abs = sqrt(mean(dist_abs ^ 2))) 

most_typical <- top_n(my_data_how_typical, 5, wt = -avg_dist_abs)
least_typical <- top_n(my_data_how_typical, 5, wt = avg_dist_abs)

Now I'll plot the top five most and least typical against the whole. The green most_typical rows tend to stay near the averages, while the least_typical have some extreme values. By keeping the rows with the least variance from average (you could keep 1000 instead of just 5), you'll end up with a list that will have averages close to the overall averages. But that list will necessarily also have a lot less variance than then original data, since it intentionally excludes rows with extreme values.

my_data %>%
  gather(col, value, -row) %>%
  sample_n(10000) %>%
  ggplot(aes(col, value)) + 
  geom_point(alpha = 0.1, size = 0.5, color = "gray40") +
  geom_line(data = least_typical %>% 
              select(row) %>%
              left_join(my_data, by = c("row" = "row")) %>%
              gather(col, value, -row),
            aes(col, value, group = row), color = "red") + 
  geom_line(data = most_typical %>% 
              select(row) %>%
              left_join(my_data, by = c("row" = "row")) %>%
              gather(col, value, -row),
            aes(col, value, group = row), color = "green")

enter image description here

Your data will look different and your definition of "most close to average" may be different, but hopefully that will steer you in the right direction.

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
0

That's a really general question, but R is perfect for this

Read in your file (I presume it's in .csv format?) e.g. my_data <- read.csv("folder/filename.csv", header = TRUE)

You can then subset the dataframe like so

my_data <- my_data[my_data$age >= 25 & my_data$age =< 30 & my_data$gender = "female" & my_data$income > 50000, ]

The above would only select females aged 25-30 with income > 50k

But if you want the subsetting to be based on averages, you can certainly do that too. E.g.

average_age = mean(my_data$age)

my_data <- my_data[my_data$age >= (average_age - 2) & my_data$age =< (average_age + 2) & my_data$gender = "female" & my_data$income > 50000, ]

You can then write the dataframe back to csv format with

write.csv(my_data, "newfile.csv", row.names=FALSE)
stevec
  • 41,291
  • 27
  • 223
  • 311
  • So there's nothing I can do that will try to get the overall average for the data set close to these values as a whole though? The issue I have is that for 30 variables trying to manually reduce ranges that impact other values will be never ending. I didn't know if there was something similar to matching where I could tell it that I want the average for this dataset to be near this value for these variables. – paulblartmallfart Feb 09 '19 at 03:59
  • Yes, you certainly can. Simply find the average and apply whatever math you want to it (e.g. for age, you could take the average and subtract/add a couple of years). I have altered the answer to reflect. I do not know of a function that will do it for you though – stevec Feb 09 '19 at 04:04
  • I don't think I'm necessarily expressing what I want well enough, but I do appreciate you bearing with me. I think that what you've written will simply filter out values above or below the average by a certain amount. Because I'm using so many variables this won't necessarily shift the data to meet the specific averages. – paulblartmallfart Feb 09 '19 at 04:07
  • I do understand what you're asking, and agree it would be a useful function. Have you considered random sampling? That is, simply selecting a random 1000 entires from the dataset? e.g. `my_data[sample(1:nrow(my_data), 1000, replace = FALSE), ]` – stevec Feb 09 '19 at 04:13