-1

I am currently trying to count the number of NAs found in each of my dataset's columns.

I am running the following code:

  function(x, df1, df2, ncp, log = FALSE)

apply(Total_HousingData, 2, function(x) {sum(is.na(x))})

Here is my output:

        Id    MSSubClass      MSZoning   LotFrontage       LotArea        Street 
            0             0             0             0             0             0 
        Alley      LotShape   LandContour     Utilities     LotConfig     LandSlope 
            0             0             0             0             0             0 
 Neighborhood    Condition1    Condition2      BldgType    HouseStyle   OverallQual 
            0             0             0             0             0             0 
  OverallCond     YearBuilt  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st 
            0             0             0             0             0             0 
  Exterior2nd    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
            0             0             0             0             0             0 
     BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2 
            0             0             0             0             1             0 
   BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC    CentralAir 
            1             1             1             0             0             0 
   Electrical      1stFlrSF      2ndFlrSF  LowQualFinSF     GrLivArea  BsmtFullBath 
            0             0             0             0             0             2 
 BsmtHalfBath      FullBath      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual 
            2             0             0             0             0             0 
 TotRmsAbvGrd    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
            0             0             0             0             0             0 
 GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond    PavedDrive 
            0             1             1             0             0             0 
   WoodDeckSF   OpenPorchSF EnclosedPorch     3SsnPorch   ScreenPorch      PoolArea 
            0             0             0             0             0             0 
       PoolQC         Fence   MiscFeature       MiscVal        MoSold        YrSold 
            0             0             0             0             0             0 
     SaleType SaleCondition     SalePrice 
            0             0          1459

For some reason, all of the NA counts are being counted on the SalePrice variable. When I look at other variables, there are plenty of NAs. I tried factoring the appropriate variables, but this still hasn't fixed the issue.

"Alley" for instance should read 1, but its NA is not being picked up.

Here is a sample of the code:

 Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities
  <dbl>      <dbl> <chr>    <chr>         <dbl> <chr>  <chr> <chr>    <chr>       <chr>    
1     1         60 RL       65             8450 Pave   NA    Reg      Lvl         AllPub   
2     2         20 RL       80             9600 Pave   NA    Reg      Lvl         AllPub   
3     3         60 RL       68            11250 Pave   NA    IR1      Lvl         AllPub   
4     4         70 RL       60             9550 Pave   NA    IR1      Lvl         AllPub   
5     5         60 RL       84            14260 Pave   NA    IR1      Lvl         AllPub   
6     6         50 RL       85            14115 Pave   NA    IR1      Lvl         AllPub   
  • 1
    Try `sapply(df, function(colValues) sum(is.na(colValues)))` and interchange `df` with your dataframe. The `sapply`-function automatically loops over the columns if you put in a `data.frame`. – Jonas Feb 19 '21 at 21:58
  • Can you give us at least a sample of your data? I cannot replicate your problem with a mocked data frame. Use `dput(head(Total_HousingData))`. – Jan Feb 19 '21 at 22:45
  • @Jan I just added the head of this data. As you can see, "Alley" has plenty of NAs, but they aren't registering on the is.na search. – Jamie Warren Feb 19 '21 at 23:01
  • @Jonas Unfortunately this produces the same output. I posted some of the data I'm using if that helps at all. – Jamie Warren Feb 19 '21 at 23:02
  • Why do you accept an answer that does not work? ... anyway, could it be, by any chance, that the `character` column `Alley` contains `"NA"` as a string and not `NA`? – Jan Feb 20 '21 at 01:13
  • @Jan that fixed it. I had "NA" instead of NA. The answer below fixed the primary issue I was having, but the string issue was preventing it from executing. – Jamie Warren Feb 20 '21 at 21:29

2 Answers2

1

Try using sapply, this is the one-liner I use, with df as your dataframe.

sapply(df, function(x) sum(is.na(x)))
Colin H
  • 600
  • 4
  • 9
0

Another solution with colSums(). is.na(df) gives you a data frame and all it’s columns are logicals being TRUE for each cell being NA. colSums()sums up the TRUE values.

Total_HousingData <- data.frame(A = c(1, 2, NA, NA, NA), B = c(1, NA, 3, 4, 5), C = c(NA, 2, 3, NA, 5))

colSums(is.na(Total_HousingData))
#> A B C 
#> 3 1 2

Created on 2021-02-20 by the reprex package (v1.0.0)

Jan
  • 4,974
  • 3
  • 26
  • 43