1

I am trying to determine the number of verbs (n) in each tense and aspect (TA) for each text (filename) in my dataset. I have my data saved as a tibble (see below), but the values in the n column are not yet accurate because some categories subsume others. For instance, to get an accurate count of past_simple, I need to subtract the number (n) for past_perfect and present_progressive in that same text. Essentially, I am looking to do something like this:

  1. Subtract the value (n) of past_perfect and past_progressive from the value of past_simple for filename BIO.GO.01.1

  2. Repeat this process for each individual file

    tib <- structure(
  list(
    TA = c(
      "past_perfect",
      "past_progressive",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_perfect",
      "past_progressive"
    ),
    tense = c(
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past"
    ),
    aspect = c(
      "perfect",
      "progressive",
      "simple",
      "simple",
      "simple",
      "simple",
      "simple",
      "simple",
      "perfect",
      "progressive"
    ),
    filename = c(
      "BIO.G0.01.1",
      "BIO.G0.01.1",
      "BIO.G0.01.1",
      "BIO.G0.02.1",
      "BIO.G0.02.2",
      "BIO.G0.02.4",
      "BIO.G0.02.5",
      "BIO.G0.02.6",
      "BIO.G0.03.1",
      "BIO.G0.03.1"
    ),
    discipline = c(
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO"
    ),
    nativeness = c("NS", "NS", "NS",
                   "NS", "NS", "NS", "NS", "NS", "NS", "NS"),
    year = c("G0",
             "G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0"),
    gender = c("F",
               "F", "F", "M", "M", "M", "M", "M", "F", "F"),
    n = c(2L, 2L,
          57L, 39L, 3L, 4L, 49L, 103L, 1L, 1L)
  ),
  class = c("grouped_df",
            "tbl_df", "tbl", "data.frame"),
  row.names = c(NA,-10L),
  groups = structure(
    list(
      filename = c(
        "BIO.G0.01.1",
        "BIO.G0.02.1",
        "BIO.G0.02.2",
        "BIO.G0.02.4",
        "BIO.G0.02.5",
        "BIO.G0.02.6",
        "BIO.G0.03.1"
      ),
      discipline = c("BIO", "BIO", "BIO", "BIO", "BIO", "BIO",
                     "BIO"),
      nativeness = c("NS", "NS", "NS", "NS", "NS", "NS",
                     "NS"),
      year = c("G0", "G0", "G0", "G0", "G0", "G0", "G0"),
      gender = c("F", "M", "M", "M", "M", "M", "F"),
      .rows = structure(
        list(1:3, 4L, 5L, 6L, 7L, 8L, 9:10),
        ptype = integer(0),
        class = c("vctrs_list_of",
                  "vctrs_vctr", "list")
      )
    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA,-7L),
    .drop = TRUE
  )
)

I know how to subtract rows from other rows based on their position, like this:

tib[3, 9] <- tib[3, 9] - tib[2, 9] - tib[1, 9]

But the rows do not always appear in this predictable order because not all TA options are present in each text (filename). I'm also not sure how to write the code to restart this process again each time it comes across a new filename.

I am still learning how to manipulate data in R. Any suggestions would be very much appreciated!

LilyLew
  • 25
  • 5
  • Try `library(dplyr);tib %>% group_by(filename) %>% mutate(n1 = n[TA == 'past_simple'][1] - n[TA == 'past_progressive'][1] - n[TA == 'past_perfect'][1]) %>% ungroup` – akrun Feb 09 '23 at 20:41
  • @akrun Thank you for responding. I managed to add the code in my question above. I tried the code you suggested but it did not quite return the results I was looking for. I'm not sure how to show the results here. Am I supposed to add this to my question? – LilyLew Feb 09 '23 at 21:01
  • Sorry, didn't see your updated post. Would this work for you? `tib %>% mutate(n = replace(n, TA == 'past_simple', n[TA == 'past_simple'][1] - n[TA == 'past_progressive'][1] - n[TA == 'past_perfect'][1])) %>% ungroup`. There are some NAs because those doesn't have some of the values in 'TA' – akrun Feb 09 '23 at 21:08
  • I guess, you wanted to update the 'n' column where the 'TA' value is 'past_simple' for each grouping? If that is the case, the solution I posted should work – akrun Feb 09 '23 at 21:19

1 Answers1

0

Based on the logic showed, we may need a group by operation. It may be better to reshape to wide before doing a join if there are some elements in 'TA' missing

library(dplyr)
library(tidyr)
tib %>%
   ungroup %>% 
   select(-tense, -aspect) %>%
   pivot_wider(names_from = TA, values_from = n, values_fill = 0) %>% 
   mutate(n1 = past_simple - past_progressive - past_perfect,  
       TA = 'past_simple', .keep = 'unused') %>% 
   left_join(tib %>% ungroup, .) %>% 
   mutate(n = coalesce(n1, n), .keep = 'unused')

-output

# A tibble: 10 × 9
   TA               tense aspect      filename    discipline nativeness year  gender     n
   <chr>            <chr> <chr>       <chr>       <chr>      <chr>      <chr> <chr>  <int>
 1 past_perfect     past  perfect     BIO.G0.01.1 BIO        NS         G0    F          2
 2 past_progressive past  progressive BIO.G0.01.1 BIO        NS         G0    F          2
 3 past_simple      past  simple      BIO.G0.01.1 BIO        NS         G0    F         53
 4 past_simple      past  simple      BIO.G0.02.1 BIO        NS         G0    M         39
 5 past_simple      past  simple      BIO.G0.02.2 BIO        NS         G0    M          3
 6 past_simple      past  simple      BIO.G0.02.4 BIO        NS         G0    M          4
 7 past_simple      past  simple      BIO.G0.02.5 BIO        NS         G0    M         49
 8 past_simple      past  simple      BIO.G0.02.6 BIO        NS         G0    M        103
 9 past_perfect     past  perfect     BIO.G0.03.1 BIO        NS         G0    F          1
10 past_progressive past  progressive BIO.G0.03.1 BIO        NS         G0    F          1
akrun
  • 874,273
  • 37
  • 540
  • 662