1

I banged my head for the last couple of hours but still unable to resolve this ...

I am trying to write a R function which takes a dataframe name and a column name as variables and tries to return a dataframe with all distinct values for the column specified, minus any NA or "N/A" values.

Here is my function,

getDistinctColValues <- function(dataset, colname, removeNA = FALSE) {

  colname <- as.name(colname)
  retVector <- dataset %>% distinct_(colname)

  # Not working!
  if (removeNA == TRUE)
  {
    retVector <- filter_(retVector, colname !=  "N/A" | !is.null(colname))
  }
  return(retVector)
}

Here is a sample output (see the N/A):

> getDistinctColValues(dataTY, "SomeColumn", TRUE)
  SomeColumn
1          BR
2          ET
3          SG
4          BV
5         N/A
6          MN
7          SP

This filter is not working. na.omit won't work because there are "N/A" strings. I am not clear about opting out of NSE. I am using the lazyeval package but without understanding it in-depth.

Any help will be greatly appreciated.

Solution (as guided by @aosmith):

getDistinctColValues <- function(dataset, colname, removeNA = FALSE) {

  colname <- as.name(colname)
  retVector <- dataset %>% distinct_(colname)
  if (removeNA == TRUE)
  {
    filter_criteria <- interp(~v!="N/A", v=as.name(colname))
    print(filter_criteria)
    retVector <- retVector %>% filter_(filter_criteria)
  }
  return(retVector)
}
KalC
  • 1,530
  • 3
  • 22
  • 33
  • 2
    I believe you'll need `lazyeval::interp`. See the answer [here](http://stackoverflow.com/questions/26492280/non-standard-evaluation-nse-in-dplyrs-filter-pulling-data-from-mysql) – aosmith Nov 23 '16 at 22:17

1 Answers1

1

You should indicate "N/A" to be an actual NA value when reading in your data using the read.csv(your_data, na.strings = c("N/A")) or readr::read_csv(your_data, na = c("N/A"))

You could also use gsub(pattern = "N/A", replacement = NA, your_data)

Then you can just use filter(your_data, is.na(SomeColumn)==F)

For example:

# Dummy data
your_data <- data_frame(var1 = c('A','B','C','D'),
                    SomeColumn = c('ET','AB','N/A', 'TC'))

# Replace 'N/A' with NA
your_data$SomeColumn <- gsub(pattern = 'N/A',replacement = NA, your_data$SomeColumn)

# Filter out NAs in selected column
your_data %>%
select(SomeColumn) %>%
filter(is.na(SomeColumn) == F)

# A tibble: 3 × 1
  SomeColumn
       <chr>
1         ET
2         AB
3         TC

If for some reason you need to keep the value as "N/A", you can change your filter command to filter(SomeColumn != "N/A")

TClavelle
  • 578
  • 4
  • 12
  • The challenge was to identify the column name (which is a variable). The solution pointed to by @aosmith worked. But I totally like your idea of replacing "N/A" with NA. Will use it in the future. Thanks! – KalC Nov 23 '16 at 22:35
  • Ah that makes more sense and is a much better question. I misread the question to be about filtering out NAs in a desired column and was pointing out you don't need to write your own function for that. – TClavelle Nov 23 '16 at 22:43