What am I doing wrong with super basic REGEX code?

Question

Suppose I have a CSV file (titled "Substance Related Deaths of Females in 2014") of a bunch of data that looks like (keep in mind this is a CSV file, and just a small sample, and it's also made-up data so the numbers aren't real):

Substance Related Deaths
of Females
by country             
2014
Country                 pregnant status     alcohol    opiates    heroin
USA                     pregnant            1,230      4,844      893
                        not pregnant        23,440     12,773     2,005
CANADA                  pregnant            1,094      735        804
                        not pregnant        18,661     5,787      1,050
GERMANY                 pregnant            444        97         203
                        not pregnant        1,007      388        1,375
MEXICO                  pregnant            786        1,456      1,532
                        not pregnant        20,562     2,645      7,594

The original CSV file contains 30 rows (including stuff we don't want at the top and bottom) and 8 columns.

Now suppose I want to ONLY keep all the rows where each row starts with a country with capitalized letters (in other words, I only want the rows that list the country first, and only the "pregnant" data). Here's what I did:

df <- readLines("substancedeaths.csv")
linesTOkeep <- grep("^[A-Z]",df)
mydata <- df[linesTOkeep]
finaltable <- read.table(textConnection(mydata),sep=",")

The original data has 10 countries, with 8 columns (first column is "State", rest are substances). The end goal is to have a data frame with 10 rows and 8 columns. But after running my code, I end up with only 8 rows and 8 columns, it's omitting the USA and CANADA rows, looking like this:

GERMANY                 pregnant            444        97         203
MEXICO                  pregnant            786        1,456      1,532

And so forth. Germany is at the top but USA and CANADA should be. Any ideas what may be happening?

Can you paste the output of `dput(df)` so that we can replicate the problem? — Weihuang Wong, Apr 20 '18 at 21:07
It looks like your error is in the `grep(...)` where you want to be running it over `df$X1` or whatever the first column is. Then you need to subset with `df[linesTOkeep, ]`. Otherwise, it's grepping over the column NAMES, not down the rows. — Brian, Apr 20 '18 at 21:39
Thanks guys, but I figured it out, turns out I just needed grep("^[A-Z]{2}",df) and that solved it. — ROCKaholic, Apr 20 '18 at 22:01

panman · Answer 1 · 2018-04-20T22:13:12.670

0

How about the following:

linesTOkeep <- grep("^[[:upper:]]{3}", df)

mydata <- df[linesTOkeep]

finaltable <- as.data.frame(do.call(rbind, strsplit(mydata, split=" {2,10}")), stringsAsFactors=FALSE)

names(finaltable) <- c("Country", "pregnant_status", "alcohol", "opiates", "heroin")

The third line does the heavy lifting. You can look on the accepted answer in this post.

edited Apr 20 '18 at 22:13

answered Apr 20 '18 at 21:43

panman

1,179
1
13
33

`readLines` returns a vector. Hence `df` is a vector above. That confirms `grep` has been used correctly. – MKR Apr 20 '18 at 21:50
@MKR: Yes, you are right, I overlooked it. Now I changed the post. – panman Apr 20 '18 at 22:15

What am I doing wrong with super basic REGEX code?

1 Answers1