0

In a large dataset of US stocks I have a integer variable containing SIC codes. https://www.sec.gov/info/edgar/siccodes.htm

I would like to create a dummy variable indicating the major group of 50, i.e. a variable that takes on 1 for durable goods and 0 otherwise.

I tried the code:

data$durable <- as.integer(grepl(pattern = "50", x = data$sic))

But this, of course, does not take the hierarchical structure of SIC into account. I want to get the "50" only for the first two digits.

(New to R)

/Alex

2 Answers2

0

Use either the division, or pad zero to left and check the first two letters.

code <- c(100, 102, 501, 5010)

# approach 1
as.integer(as.integer(code/100) == 50)

# approach 2
as.integer(substring(sprintf("%04d", code), 1, 2) == "50")
Kota Mori
  • 6,510
  • 1
  • 21
  • 25
0
library(readxl)
library(dplyr)
library(stringi)

data_sic <- read_excel("./sic_example.xlsx")

data_sic$temp1 <- stri_sub(data_sic$SIC,1,2)

data_sic <- mutate(data_sic, durable_indicator =
                     ifelse(temp1 == "50", 1, 0))

str(data_sic)

Output:

str(data_sic)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   6 obs. of  4 variables:
 $ SIC              : num  4955 4961 4991 5000 5010 ...
 $ Industry Title   : chr  "HAZARDOUS WASTE MANAGEMENT" "STEAM & AIR-CONDITIONING SUPPLY" "COGENERATION SERVICES & SMALL POWER PRODUCERS" "WHOLESALE-DURABLE GOODS" ...
 $ temp1            : chr  "49" "49" "49" "50" ...
 $ durable_indicator: num  0 0 0 1 1 1

Addendum:

There are multiple ways to approach this problem.

I would suggest reviewing the stringi package Link to documentation for string editing.

As well as, the caret package - documentation for dummification of variables and other statistical transformations.

Prometheus
  • 1,977
  • 3
  • 30
  • 57