0

I'm trying to change a data frame with several thousands of rows that each look like one of the following variations:

table, th, td {
  border: 1px solid black
}
<table>
  <tr>
    <th> a </th>
    <th> b </th>
    <th> c </th>
  </tr>
  <tr>
    <td> $10 x and $7 y </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td> $70 a; $95 b </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td> $6 j </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
</table>

and change it into this:

table, th, td {
  border: 1px solid black
}
    <table>
      <tr>
        <th> a </th>
        <th> b </th>
        <th> c </th>
      </tr>
      <tr>
        <td> $10 x and $7 y </td>
        <td> $10 x </td>
        <td> $7 y </td>
      </tr>
      <tr>
        <td> $70 a; $95 b </td>
        <td> $70 a </td>
        <td> $95 b </td>
      </tr>
      <tr>
        <td> $6 j </td>
        <td> $6 j </td>
        <td> NA </td>
      </tr>
    </table>

This is my current code to accomplish this (I use the number of dollar signs because that is the only consistent value to determine number of transactions):

(This is formatted as a data.table, just in case that makes a difference)

  df$b[(str_count(df$a, pattern = "\\$") == 2)] = unlist(strsplit(df$a, " and "))[1]
  df$c[(str_count(df$a, pattern = "\\$") == 2)] = unlist(strsplit(df$a, " and "))[2]
  df$b[str_count(df$a, pattern = "\\$") < 2] = df$a 

Now, instead of the desired result, I get the following:

table, th, td {
  border: 1px solid black
}
<table>
  <tr>
    <th> a </th>
    <th> b </th>
    <th> c </th>
  </tr>
  <tr>
    <td> $10 x and $7 y </td>
    <td> $10 x </td>
    <td> $7 y </td>
  </tr>
  <tr>
    <td> $70 a; $95 b </td>
    <td> $10 x</td>
    <td> $7 y</td>
  </tr>
  <tr>
    <td> $6 j </td>
    <td> $6 j </td>
    <td> NA </td>
  </tr>
</table>

Does anyone know how to fix this issue? I think it has to do with the fact that the strsplit() is taking the first subsetted row and applying it to every row in the subset, but I don't know how to change it to work properly.

2 Answers2

2

Don't try and write code to parse HTML, just call an HTML parser:

library(rvest)
library(tidyverse)

stage1 <- 
  "<table>
  <tr>
    <th> a </th>
    <th> b </th>
    <th> c </th>
  </tr>
  <tr>
    <td> $10 x and $7 y </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td> $70 a; $95 b </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
  <tr>
    <td> $6 j </td>
    <td> NA </td>
    <td> NA </td>
  </tr>
</table>" %>% 
  rvest::minimal_html() %>% 
  rvest::html_node("table") %>% 
  rvest::html_table() %>% 
  as_tibble()

stage1

# A tibble: 3 x 3
  a              b     c    
  <chr>          <lgl> <lgl>
1 $10 x and $7 y NA    NA   
2 $70 a; $95 b   NA    NA   
3 $6 j           NA    NA   

Now clean up stage1 using separate and a regex

stage1 %>% 
  select(a) %>% 
  separate(col = "a", into = c("b", "c"), 
           sep = "(?ix) \\s* (and|;) \\s*",   # Perl stye regex, cases insensitive.
           remove = FALSE, 
           fill= "right")


  a              b     c    
  <chr>          <chr> <chr>
1 $10 x and $7 y $10 x $7 y 
2 $70 a; $95 b   $70 a $95 b
3 $6 j           $6 j  NA   
David T
  • 1,993
  • 10
  • 18
  • Sorry for the late reply - thanks for your excellent solution! I've got to get better with regex, I think! – Beck Addison Jun 15 '20 at 18:34
  • Lots of good material on regex out there. In the present context, look at https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions Although he doesn't cover Perl-style, or look-ahead/behind . But it's pretty readable, and the exercises are worth practicing. – David T Jun 15 '20 at 21:19
0

You can use str_split_fixed from stringr

stringr::str_split_fixed(df$a, '\\s*(;|and)\\s*', 2)

#       [,1]    [,2]   
#[1,] "$10 x" "$7 y" 
#[2,] "$70 a" "$95 b"
#[3,] "$6 j"  ""     
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213