0

I'm working with a mixed dataset (unique at the firm-year level) with related variables that look something like the following (but with many more variables of a similar nature), where:

  • "sec" is the sector the firm belongs to and doesn't change over the years;
  • "emp" is the total number of employees the firm employs;
  • "emp_deg" is the share of employees the firm employs that hold a degree;
  • "emp_nodeg" is the share of employees the firm employs that do not hold a degree;
  • "prof" is the amount of profits the firm earns for the year;
  • "x_ind" is an indicator that takes on value (0)1 if a firm (does not) exports goods or services;
  • "xgds_ind" is an indicator that takes on value (0)1 if a firm (does not) exports goods;
  • "xsvc_ind" is an indicator that takes on value (0)1 if a firm (does not) exports services;
  • "xgds_val" is the value of the firm's goods exports and is 0 if the firm does not export goods; and
  • "xsvc_val" is the value of the firm's services exports and is 0 if the firm does not export services.
id yr sec emp emp_deg emp_nodeg prof x_ind xgds_ind xsvc_ind xgds_val xsvc_val
1 19 a 10 0.5 0.5 305 1 0 1 0 200
1 20 a 20 0.6 0.4 400 0 0 0 0 0
1 21 a 15 0.4 0.6 105 1 1 1 20 230
2 19 b 10 0.5 0.5 349 1 0 1 0 200
2 20 b 9 0.3 0.7 293 0 1 0 54 0
2 21 b 83 0.8 0.2 243 0 0 0 0 0
3 19 c 103 0.6 0.4 125 0 0 0 0 0
3 20 c 50 0.5 0.5 234 0 0 0 0 0
3 21 c 25 0.1 0.9 392 0 0 0 0 0

I want to use clustering methods to identify groups of firms that are similar to one another, but I'm having some issues thinking about how to process the data/select the variables for clustering, and was hoping to get some advice on the following points:

  1. Is it alright to transform the data to a wide format before performing clustering (e.g., k-means, hierarchical, DBScan), or will there be issues with the clustering results since some of the variables could be correlated over years? If so, would running something like PCA on the transformed data help?
data = data %>% reshape(direction = "wide", idvar=c("id"), timevar="year") %>% select(-c("id"))
set.seed(123)
kmeans = kmeans(data, centers = 5, nstart = 50) #where 5 is selected from the elbow plot
hier = hclust(dist(data))
  1. Are there issues with PCA/clustering on the following types of variables?
  • Categorical variables that have been one hot encoded (e.g., "sec_a/b/c" = 0/1 for firm in sector a/b/c) [do I need to drop one of the categories when running PCA/clustering since a value of 0 for "sec_a" and "sec_b" implies that the value of "sec_c" is 1]
  • Variables that sum/can be derived from other variables (e.g., "emp_deg" and "emp_nodeg" sum to 1) [do I need to drop one of these variables? does the same issue apply if I have variables like "total_wagebill", "avg_deg_wage", "avg_nodeg_wage" on top of "emp", "emp_deg", "emp_nodeg"]
  • Variables that only take on a non-zero value for observations in a certain category (e.g., "xgds_val", "xsvc_val")

For the first bullet under 2 (i.e., for categorical variables), I understand that there are multiple ways to deal with this - i. only work with continuous variables, ii. run PCA/clustering normally after using one hot encoding, iii. use methods like FAMD to handle mixed data, but I'm wondering if there are any other methods out there to handle such variables? For the other two bullets under 2, would appreciate some advice on how to deal with such variables.

  1. For DBScan clustering, is there a widely used heuristic to decide on epsilon/minpts?

Thanks in advance!

jess0192
  • 1
  • 1
  • Welcome on stackoverflow [r], jess0192. Your question is quite broad, and it addresses statistics rather than their implementation specifically with R language. Please consider moving your question to https://stats.stackexchange.com/ or another dedicated (ML) place. (see also https://stackoverflow.com/help/how-to-ask please) – I_O Jun 27 '23 at 11:25

0 Answers0