Keep the first 4 words in a column

Question

I'm trying to only keep the first 4 words of a column in my data and still want to keep the other observations that have less than 4 words.

This is a sample of what some of the data looks like.

State	Company	Number of workers
X	FAIRFIELD NURSING AND REHABILITATION CENTER,	99
Y	ATHENAHEALTH	24
Z	DRS TEST & ENERGY MANAGEMENT,	1009
W	AMERICAN APPAREL	376
C	BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES	67
A	TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE	456

I've used the following code

library(stringr)

df$Company1 <- word(df$Company, 1, 4)

While this is providing column of 4 word company names, this is not working for me because it is getting rid of the companies that have less than 4 words returning NA for those instead.

So I'm hoping to find a solution to keep every observations that has 1 to 4 words.

score 1 · Accepted Answer · answered Apr 27 '23 at 21:49

You may do that following below.

split Company using str_split() in stringr.
paste each rows with apply()
remove whitespace of right side.

library(stringr)

df <- data.frame(
  State = c("X","Y","Z","W","C","A"),
  Company = c("FAIRFIELD NURSING AND REHABILITATION CENTER",    
  "ATHENAHEALTH",   
  "DRS TEST & ENERGY MANAGEMENT",   
  "AMERICAN APPAREL",   
  "BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES",
  "TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE"),
  number_of_workers = c(99,24,1009,376,67, 456))

df$Company1 <- str_split(df$Company," ", simplify = T)[,1:4] |> 
  apply(1, paste, collapse=" ") |> 
  trimws(which = "right")

output

[1] "FAIRFIELD NURSING AND REHABILITATION"
[2] "ATHENAHEALTH"                        
[3] "DRS TEST & ENERGY"                   
[4] "AMERICAN APPAREL"                    
[5] "BERRY PLASTICSPANY -ALENCE SPECIALTY"
[6] "TUSCALOOSA RESOURCES , SWANN'S"

^{Created on 2023-04-28 with reprex v2.0.2}

Keep the first 4 words in a column

1 Answers1