3

So I was beginning to perform some statistics on a data frame that I have read in using the read_xls function from the readxl package(version:1.1.0) when I realized R wasn't reading in a column the way I wanted it to. The column has a great deal of blanks in the excel spreadsheet, which I believe after doing some research is the issue.After a lot of blank rows, the column does have numeric values that I need to perform my analysis in R. However, when I read it in with the read_xls function, it gives it a class logical and all NAs... After going over this website readxl it seems clear to me that this problem is due to the blanks in the column. I am still confused though on how to solve this issue since only one of the columns has blanks in the beginning of the dataset. I would appreciate any help or guidance! Thanks! The column that is giving me issues is Rep_Val_Quantity_Avg

dput of data:

 dput(head(df_trib,10))
structure(list(NJPDES = c("NJ0020206", "NJ0020532", "NJ0021326", 
"NJ0022021", "NJ0022985", "NJ0023361", "NJ0023736", "NJ0024015", 
"NJ0024031", "NJ0024040"), Facility_Name = c("ALLENTOWN BORO WWTP", 
"HARRISON TWP MULLICA HILL WWTP", "MEDFORD LAKES BOROUGH STP", 
"SWEDESBORO WTP", "WRIGHTSTOWN BOROUGH STP", "WILLINGBORO WATER POLLUTION CONTROL PLANT", 
"PINELANDS WASTEWATER CO", "MOUNT HOLLY WPCF", "ELMWOOD WTP", 
"WOODSTREAM STP"), `Monitored Location Designator` = c("001A", 
"001A", "001A", "001A", "001A", "001A", "001A", "001A", "001A", 
"001A"), Date = structure(c(1372550400, 1372550400, 1372550400, 
1372550400, 1372550400, 1372550400, 1372550400, 1372550400, 1372550400, 
1372550400), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Parameter_Number_DMR = c("00300", "00300", "00300", "00300", 
    "00300", "00300", "00300", "00300", "00300", "00300"), Parameter = c("Oxygen, Dissolved (DO)", 
    "Oxygen, Dissolved (DO)", "Oxygen, Dissolved (DO)", "Oxygen, Dissolved (DO)", 
    "Oxygen, Dissolved (DO)", "Oxygen, Dissolved (DO)", "Oxygen, Dissolved (DO)", 
    "Oxygen, Dissolved (DO)", "Oxygen, Dissolved (DO)", "Oxygen, Dissolved (DO)"
    ), Sample_Point_Desc = c("Effluent Gross Value", "Effluent Gross Value", 
    "Effluent Gross Value", "Effluent Gross Value", "Effluent Gross Value", 
    "Effluent Gross Value", "Effluent Gross Value", "Effluent Gross Value", 
    "Effluent Gross Value", "Effluent Gross Value"), Rep_Val_Quantity_Avg = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA), X__1 = c(NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA), `Reported Value Quantity Maximum` = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA), `Quantity Units Description` = c(NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA), Rep_Val_Con_Min = c("7.2", 
    NA, "7.65", "6.79", NA, NA, "6", NA, "6.6", NA), Val_Con_AVG = c("7.3", 
    "8.8", NA, "7.58", "7.5", "7.100", "5", "7.8", "6.6", "7.4"
    ), Rep_Val_Con_Max = c(NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, NA_character_), valunit = c("MILLIGRAMS PER LITER", 
    "MILLIGRAMS PER LITER", "MILLIGRAMS PER LITER", "MILLIGRAMS PER LITER", 
    "MILLIGRAMS PER LITER", "MILLIGRAMS PER LITER", "MILLIGRAMS PER LITER", 
    "MILLIGRAMS PER LITER", "MILLIGRAMS PER LITER", "MILLIGRAMS PER LITER"
    )), .Names = c("NJPDES", "Facility_Name", "Monitored Location Designator", 
"Date", "Parameter_Number_DMR", "Parameter", "Sample_Point_Desc", 
"Rep_Val_Quantity_Avg", "X__1", "Reported Value Quantity Maximum", 
"Quantity Units Description", "Rep_Val_Con_Min", "Val_Con_AVG", 
"Rep_Val_Con_Max", "valunit"), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Code Used:

df_trib<-read_xls("4_Del_Tribs_ DMR data all pull for certain params.xls",
                              sheet = "NJEMS DATA", col_names = TRUE,
                              col_types = c("text","text","text","date","text","text","text",
                                            "numeric","numeric","text","numeric",
                                            "numeric","numeric","text","text"))
NBE
  • 641
  • 2
  • 11
  • 33
  • 3
    Have you tried using the `col_types =` option in the read_xls and specify the variable type of each column? – Dave2e Jul 31 '18 at 20:02
  • Specifically, it is usually easiest to use the catch all character type to read in messy columns – Calum You Jul 31 '18 at 20:18
  • @Dave2e Yes I tried using the col_types= option but that doesn't solve the issue ... As stated in the question , I think it has to do with there being a large amount of blanks before the first data point in that one column.. Do you have any suggestions on how to deal with that issue? – NBE Aug 01 '18 at 13:24
  • 2
    Maybe changing the `guess_max =` option. Without a better description of the Excel file and the code you used, it is difficult to reproduce the problem you are experiencing and provide any meaningful help. – Dave2e Aug 01 '18 at 13:49
  • @Dave2e added code I'm using.. Not sure what more of a better description I can give about the excel file.. The column that is giving me a problem is the Rep_Val_Quantity_Avg. The reason is because it has blank cells for the first 1000 rows.. I just don't understand how to address the issue in R, being that I am fairly new to using the software.. – NBE Aug 01 '18 at 13:53
  • @Dave2e since it has blank cells for the first 1000 rows it's giving the entire column NAs . Which I don't want to happen because there is valuable data points needed for the analysis after the 1000 row – NBE Aug 01 '18 at 13:55

1 Answers1

2

In the function read_excel, by default, the guess_max argument is at most 1000 rows. You can force read_excel to look beyond the first 1000 rows and get to the actual data points by making guess_max larger.

Lambda Moses
  • 433
  • 5
  • 14
  • Not sure why I would use guess_max when I add the col_types argument with the variable type for each column? Can you explain? – NBE Aug 06 '18 at 14:17
  • Technically you don't really need guess_max if you specify col_types. I just wonder whether using guess_max may help. I tried to use a toy example to reproduce your problem (all NAs except one number in the 1200th row), but couldn't; that column was read correctly as double in my case, though the double NA is printed like logical NA in tibbles. Probably there's some other problem. – Lambda Moses Aug 06 '18 at 18:46