0

I would like to run a regression using a training data frame that I have put into tidy text format. The original data file includes participants with noted developmental disabilities and participants who may or may not have a developmental disability. I created a data frame from a larger tidy text data frame that picked up on key words in my text files and noted how many times the word occurred in the text document. Those with a noted disability have "D" in front of their first name. It looked like this:

Name of Text File     Word       n
    DAdam            autism      3
    DAdam             adhd       2
    DJane            autism      1
     Mark             adhd       4
     Joey             add        3

I then added binary variables to denote if the word occurred with 1 for yes and 0 for no

df$autism <- 1
df$autism <- if_else(one_dev$word == "autism", 1, 0)

So now the data frame looks like this:

Name of Text File     Word       n   autism  adhd   add
    DAdam            autism      3      1     0      0 
    DAdam             adhd       2      0     1      0
    DJane            autism      1      1     0      0
     Mark             adhd       4      0     1      0 
     Joey             add        3      0     0      1

I would like it to look like this:

   Name of Text File    autism  adhd   add
    DAdam                  1     1      0 
    DJane                  1     0      0
     Mark                  0     1      0 
     Joey                  0     0      1

And then I would like to be able to run a regression to try and predict if a particular participant is likely to have developmental disability.

Thank you!

  • In terms of regression, you should not throw away data but use the frequencies as weights and look for some additional features to obtain better outcomes. – OzanStats Aug 08 '18 at 15:00
  • Correct me if I'm wrong, but this question is more about reshaping data *in preparation for* regression, not the regression itself – camille Aug 08 '18 at 16:11
  • camille: It's both, but I do need to prepare the data before I can run the regression. – Danielle Strauss Aug 08 '18 at 16:20

4 Answers4

0

A combination of tidyr and dplyr can get you there. Starting from your tidytext data.frame, you can continue with spreading the data, and mutating everything after the first column.

df1 %>% 
  spread(Word, n) %>% 
  mutate_at(-1, function(x) ifelse(is.na(x), 0, 1))

  Name_of_Text_File add adhd autism
1             DAdam   0    1      1
2             DJane   0    0      1
3              Joey   1    0      0
4              Mark   0    1      0

data:

df1 <- structure(list(Name_of_Text_File = c("DAdam", "DAdam", "DJane", 
"Mark", "Joey"), Word = c("autism", "adhd", "autism", "adhd", 
"add"), n = c(3L, 2L, 1L, 4L, 3L)), class = "data.frame", row.names = c(NA, 
-5L))
phiver
  • 23,048
  • 14
  • 44
  • 56
0

Similar to the other answer:

library(dplyr)
library(tidyr)

df1 %>% 
    mutate(n = 1) %>%  
    spread(Word, n, fill = 0)
#   Name_of_Text_File add adhd autism
# 1             DAdam   0    1      1
# 2             DJane   0    0      1
# 3              Joey   1    0      0
# 4              Mark   0    1      0
AndS.
  • 7,748
  • 2
  • 12
  • 17
0

You could also use summarise to get the desired output

library(dplyr)
df2 <- df1 %>% group_by(Name_of_Text_File) %>% summarise(autism = sum(autism), add = sum(add), adhd = sum(adhd))

SmitM
  • 1,366
  • 1
  • 8
  • 14
0

If you have text in a tidy format and you want it in a format suitable for modeling, you typically want to cast() it. I often use cast_sparse(), especially if I want to do glmnet modeling.

You would start out like so:

library(tidyverse)
library(tidytext)

df <- tribble(~name,  ~disability, ~word, ~count,
              "Adam",  TRUE,   "autism", 3,
              "Adam",  TRUE,   "adhd",   2,
              "Jane",  TRUE,   "autism", 1,
              "Mark",  FALSE,  "adhd",   4,
              "Joey",  FALSE,  "add",    3)

sparse_words <- df %>%
  cast_sparse(name, word, count)

sparse_words
#> 4 x 3 sparse Matrix of class "dgCMatrix"
#>      autism adhd add
#> Adam      3    2   .
#> Jane      1    .   .
#> Mark      .    4   .
#> Joey      .    .   3

Then you could go on to use this sparse matrix in any kind of machine learning model that likes matrix input (that's most of them!). Here, let's walk through how to just make a simple data frame and fit a toy regression.

df_model <- sparse_words %>% 
  as.matrix() %>% 
  tbl_df() %>% 
  bind_cols(df %>% 
              distinct(name, disability) %>%
              select(disability))

df_model
#> # A tibble: 4 x 4
#>   autism  adhd   add disability
#>    <dbl> <dbl> <dbl> <lgl>     
#> 1      3     2     0 TRUE      
#> 2      1     0     0 TRUE      
#> 3      0     4     0 FALSE     
#> 4      0     0     3 FALSE

lm(disability ~ ., data = df_model)
#> 
#> Call:
#> lm(formula = disability ~ ., data = df_model)
#> 
#> Coefficients:
#> (Intercept)       autism         adhd          add  
#>      0.8000       0.2000      -0.2000      -0.2667

Created on 2018-08-14 by the reprex package (v0.2.0).

Julia Silge
  • 10,848
  • 2
  • 40
  • 48