0

I'm trying to only keep the first 4 words of a column in my data and still want to keep the other observations that have less than 4 words.

This is a sample of what some of the data looks like.

State Company Number of workers
X FAIRFIELD NURSING AND REHABILITATION CENTER, 99
Y ATHENAHEALTH 24
Z DRS TEST & ENERGY MANAGEMENT, 1009
W AMERICAN APPAREL 376
C BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES 67
A TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE 456

I've used the following code

library(stringr)

df$Company1 <- word(df$Company, 1, 4)

While this is providing column of 4 word company names, this is not working for me because it is getting rid of the companies that have less than 4 words returning NA for those instead.

So I'm hoping to find a solution to keep every observations that has 1 to 4 words.

Phil
  • 7,287
  • 3
  • 36
  • 66
bear_525
  • 41
  • 5

1 Answers1

1

You may do that following below.

  1. split Company using str_split() in stringr.
  2. paste each rows with apply()
  3. remove whitespace of right side.
library(stringr)

df <- data.frame(
  State = c("X","Y","Z","W","C","A"),
  Company = c("FAIRFIELD NURSING AND REHABILITATION CENTER",    
  "ATHENAHEALTH",   
  "DRS TEST & ENERGY MANAGEMENT",   
  "AMERICAN APPAREL",   
  "BERRY PLASTICSPANY -ALENCE SPECIALTY ADHES",
  "TUSCALOOSA RESOURCES , SWANN'S CROSSING MINE"),
  number_of_workers = c(99,24,1009,376,67, 456))

df$Company1 <- str_split(df$Company," ", simplify = T)[,1:4] |> 
  apply(1, paste, collapse=" ") |> 
  trimws(which = "right")

output

[1] "FAIRFIELD NURSING AND REHABILITATION"
[2] "ATHENAHEALTH"                        
[3] "DRS TEST & ENERGY"                   
[4] "AMERICAN APPAREL"                    
[5] "BERRY PLASTICSPANY -ALENCE SPECIALTY"
[6] "TUSCALOOSA RESOURCES , SWANN'S"

Created on 2023-04-28 with reprex v2.0.2

YH Jang
  • 1,306
  • 5
  • 15