6

I am reading in a data frame from an online csv file, but the person who create the file has accidentally entered some numbers into column which should just be city names. Sample for cities.data table.

City        Population   Foo   Bar
Seattle     10           foo1  bar1
98125       20           foo2  bar2
Kent 98042  30           foo3  bar3
98042 Kent  30           foo4  bar4

Desired output after removing rows with only numbers in the city column:

City        Population   Foo   Bar
Seattle     10           foo1  bar1
Kent 98042  30           foo3  bar2
98042 Kent  30           foo4  bar4

I want to remove the rows with ONLY numbers in the city column. Kent 98042 and 98042 Kent are both okay since it contains the city name, but since 98125 is not a city I remove that row.

I can't use is.numeric because the number is being read as a string in the csv file. I tried using regex,

cities.data <- cities.data[which(grepl("[0-9]+", cities.data) == FALSE)]

But this deletes rows with any numbers rather than just the one containing only numbers, e.g.

City        Population   Foo   Bar
Seattle     10           foo1  bar1

"Kent 98042" was deleted even though I wanted to keep that row. Suggestions? Please and thanks!

sushi
  • 274
  • 1
  • 4
  • 13
  • Do you wan to keep those numbers after text? For example, in your desired output does the third observation have to be "Kent 98042" or or is "Kent" fine? – cparmstrong Dec 01 '17 at 22:23
  • Try adding the mandatory start of the string `grepl("^[0-9]+", cities.data)` – storaged Dec 01 '17 at 22:25
  • 1
    I don't need the number after the city, just "Kent" would be fine but I thought doing that might be harder. & oh sorry I forgot to include an example where the number was in front of the city, e.g. "98042 Kent" in which cause the "^" would not work because it eliminates that row. – sushi Dec 01 '17 at 22:31
  • @siushi I just added a solution below that solves it while dropping all numbers and should also deal with the before-text case as well. – cparmstrong Dec 01 '17 at 22:34

3 Answers3

4
df = read.table(text = "
City        Population   Foo   Bar
Seattle     10           foo1  bar1
98125       20           foo2  bar2
Kent98042  30           foo3  bar2
", header=T, stringsAsFactors=F)

library(dplyr)

df %>% filter(is.na(as.numeric(City)))

#        City Population  Foo  Bar
# 1   Seattle         10 foo1 bar1
# 2 Kent98042         30 foo3 bar2

The idea is that when we apply as.numeric to a character variable it will not return a NA value only if it is a number.

If you want to use base R you can use this: df[is.na(as.numeric(df$City)),]

AntoniosK
  • 15,991
  • 2
  • 19
  • 32
3

If you don't need the numbers in the city column at all:

# replace all numbers with empty string
cities.data$City <- gsub("[0-9]+", "", cities.data$City) 
# drop observations that are only empty strings
cities.data <- cities.data[cities.data$City!="",]  

edit: This should handle all cases in your updated example where numbers can be anywhere in the string.

Cyrus Mohammadian
  • 4,982
  • 6
  • 33
  • 62
cparmstrong
  • 799
  • 6
  • 23
1

With plain R:

df <- data.frame(City = c('Seattle', '98125', 'Kent 98042'),
                 Population = c(10, 20, 30),
                 Foo = c('foo1', 'foo2', 'foo3'))
df2 <- df[-grep('^\\d+$', df$City),]
df2

This yields

        City Population  Foo
1    Seattle         10 foo1
3 Kent 98042         30 foo3


The idea is to look for ^\d+$ (only numbers) and delete these from the set. Mind the anchors on both sides.
Jan
  • 42,290
  • 8
  • 54
  • 79
  • Thanks! I'm not very good at regex and didn't know I could anchor the ends of the expression. – sushi Dec 01 '17 at 22:40