3

end_result_tbl

This end_result_tbl is an example from a different voter file in ideal format.

ID     GEN_16  GEN_14  GEN_08  PP_16  PR_16  PR_15  PR_14
0001     1       1       1       1      0      0       0
0002     0       0       0       0      1      0       1
0003     1       1       1       0      0      0       0
0004     1       0       1       0      0      0       1
0005     1       0       1       1      1      0       1

raw_data_tbl

ID     Voter_History  
0001   GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
0002   2016 GENERAL ELECTION;2014 GENERAL ELECTION
0003   20121106 GE;20081104 GE;General Election 2006
0004   GE 20150910
0005   16 GENERAL ELECTION; 14 PRIMARY ELECTION

Looking to make variables for each election out of conditional string matches for each string of text.

Each election has about 9 iterations. if one iteration is matched for an election, then a "1" is placed to show a VOTE in that election, if none are matched, then a "0" for a NO VOTE.

Below are the iterations for the 2016 November General Election

GEN_16<-c("20161108 GE",
          "16 GENERAL ELECTION",
          "GENERAL 2016",
          "GENERAL ELECTION 2016", 
          "2016 GENERAL ELECTION", 
          "GENERAL ELECTION, 2016",
          "16 GENERAL ELECTION",
          "GE 20161108")

Here is what I have tried (attempting only 2016 General Election):

 raw_data_tbl$GEN_16<- 
 as.integer(stri_detect(raw_data_tbl$Voter_History,GEN_16))

 which(GEN_16%in%raw_data_tbl$Voter_History

require(dplyr)
Sequences <- GEN_16
Database <- raw_data_tabl$Voter_History

df=as.data.frame(sapply(Sequences, function(x) grep(x,Database)))
stats=df %>% summarise_all(funs(sum))
cbind(Sequences,as.numeric(stats))

this is actually a quite simple albeit super long code in sql but find it's equivalent in R hard to find.

raw_data_tabl has about 17 million voters in it.

any direction is super appreciated, thanks in advance.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Jorge
  • 83
  • 1
  • 5

1 Answers1

2

You can try this -

library(stringr)
library(tidyverse)

#read input file
txt <- readLines("test.txt")

#put delimiter between columns and transform it into a dataframe
txt <- gsub("\\s+(.*)", ",\\1", txt)
df <- read.table(textConnection(txt), 
                header = T, stringsAsFactors = F, sep = ",", colClasses = c("ID" = "character"))

Initial dataframe looks like

> df
#    ID                                                         Voter_History
#1 0001 GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
#2 0002                           2016 GENERAL ELECTION;2014 GENERAL ELECTION
#3 0003                         20121106 GE;20081104 GE;General Election 2006
#4 0004                                                           GE 20150910
#5 0005                              16 GENERAL ELECTION; 14 PRIMARY ELECTION

Clean Voter_History column's data to extract useful information

election_func <- function(x){
  #extract year
  yr <- gsub("20", "", substr(str_extract_all(strsplit(x, split=";")[[1]], "[0-9]+"), 1, 4))
  #extract election type
  elec_type <- toupper(substr(str_extract(strsplit(x, split=";")[[1]], '[A-Za-z]+'), 1, 2))

  return(paste(sort(paste(elec_type, yr, sep="_")), collapse = ";"))
  }

df$Voter_History <- do.call(rbind, lapply(df$Voter_History, function(x) election_func(x)))

Cleaned data is

> df
#    ID                 Voter_History
#1 0001 GE_04;GE_08;GE_12;GE_16;PP_08
#2 0002                   GE_14;GE_16
#3 0003             GE_06;GE_08;GE_12
#4 0004                         GE_15
#5 0005                   GE_16;PR_14

Finally transform this data in the desired format

df1 <- df %>%
  separate_rows("Voter_History", sep= ";") %>%
  distinct(ID, Voter_History) %>%
  mutate(value = 1) %>%
  spread(Voter_History, value, fill = 0)

df1
#    ID GE_04 GE_06 GE_08 GE_12 GE_14 GE_15 GE_16 PP_08 PR_14
#1 0001     1     0     1     1     0     0     1     1     0
#2 0002     0     0     0     0     1     0     1     0     0
#3 0003     0     1     1     1     0     0     0     0     0
#4 0004     0     0     0     0     0     1     0     0     0
#5 0005     0     0     0     0     0     0     1     0     1


Sample data: test.txt contains

ID     Voter_History  
0001   GE 20161108;20121106 GE;20081104 GE;20080205 PP;General Election 2004
0002   2016 GENERAL ELECTION;2014 GENERAL ELECTION
0003   20121106 GE;20081104 GE;General Election 2006
0004   GE 20150910
0005   16 GENERAL ELECTION; 14 PRIMARY ELECTION


(Update - added a logic to resolve Error: Duplicate identifiers for rows.... This was happening because of duplicate ID & Voter_History combination in spread call)

Prem
  • 11,775
  • 1
  • 19
  • 33
  • thanks for your answer Prem. it is almost there i believe. I keep getting error duplicate identifiers in the ouput. for example: "Error: Duplicate identifiers for rows (117, 118, 119, 120, 121), (207, 208, 209), (294, 295), (312, 313), (360, 361)..." Was thinking of going about using gsub() make the initial dataset more uniform – Jorge May 20 '18 at 16:42