1

So I have been using a excel for manual data entry for a CSV file which I am reading into R Studio using read_csv.

It is reading in fine, my date is in this format 31/07/2013

so I used this to convert it

patients$Date <- as.Date(patients$Date, "%d/%m/%Y")

This all seems straightforward. However, for the second data set running it doesn't convert all values and converts some to NA.

Part of this stems from Excel's in built number format. Which a run date in three ways

  1. General : Days counted 1900/01/01
  2. Short : 31/07/2013
  3. Long : July 31 2013

Now thing is, this time I left it as it was. So I expecting that I should haven an issues. My Data set is a few thousands rows long and as.Date()appears to have non-systematically let some values out. Manually, the changes in excel are not manifesting themselves in R. How can I find these values and manually change them within R.

I have also put two columns of Year and Month next to it for reference (as I thought I may have this problem this time).

   Year Month   Date    Site    Category    Services    Count
2013    July    31/07/2013  K   Test    Laboratory  1642
2013    July    31/07/2013  K   Test    X-Ray   16

I do not require in the day in my current format of 31/07/2013 so could work with 07/2013

So I use this to filter my data

patients %>%
  group_by(Date)%>%
  filter(Category == "Registration", Site == "K")

So I get the following output

Year Month  Date       Site  Category     Services                Count
   <int> <chr>  <chr>      <chr> <chr>        <chr>                   <int>
 1  2013 July   31/07/2013 K   Registration Old Registration Cld       25
 2  2013 July   31/07/2013 K   Registration Old Registration Male     155
 3  2013 July   31/07/2013 K   Registration Old Registration Female   445
 4  2013 July   31/07/2013 K   Registration New Registration Child     24
 5  2013 July   31/07/2013 K   Registration New Registration Male     106

And once I Apply patients$Date <- as.Date(patients$Date, "%d/%m/%Y") some of they show as below...BUT not all of them as the column type in the tibble has been changed to Date So it has worked to some extent.

2017 June  NA         K   Registration Old Registration Female   351
2  2017 June  NA      K   Registration New Registration Child     20
3  2017 June  NA      K   Registration New Registration Male      66
4  2017 June  NA      K   Registration New Registration Female   117
5  2017 June  NA      K   Registration Staff Registration         37
6  2017 June  NA      K   Registration Free  Registration          7
> 

Is there a way of sorting this?

And what software should I use in future for managing CSV's Open Office's calc seems to be no better. Do I need just a CSV manager to avoid Excel causing problems?

Help would be much appreciated

Cheers


Asif LalDin
  • 111
  • 1
  • 8
  • Hey Asif, welcome to StackOverflow. Please provide a reproducible example so we can see the problem. You can subset the dataframe where the dates are causing problems and use dput() to create code for reproducing the data.frame. – Stanislaus Stadlmann Oct 04 '18 at 13:32

1 Answers1

0

Hi All(the two people who have read this ha!)

So I have figured out the issue. Very simple mistake.

However, first I'll start off with a short poem

Thirty days hath September, April, November, and June; All the rest have thirty-one, Excepting February alone, And that has twenty-eight days clear And twenty-nine in each leap year.

For some reason Excel was putting all my dates to be the 31st of the month. What I should have realised was that the dates converted to NA were in a systematic manner pertaining to months that do not have 31 days.

Asif LalDin
  • 111
  • 1
  • 8