-2

I'm writing a script to get info from Baseball Reference web pages. The first time i wrote the code it worked fine and all the dates stored as factors were correctly parse to dates with the as.Date() function. Nevertheless, the a day later I ran the same script and I'm getting "NA"s in some of the dates of a variable and others are been converted well. There is another factor variable where all of them are returned as "NA"s.

I've goggled about it but I could only found issues about "NA"s because of missing days on the value (only month and year).

I've tried also to change the sys.setlocale from Portugal to US (LC_ALL","English") but I get the same result.

Th script I used is. Do you have any hint of what's missing?

Thanks.

library(XML)
Sys.setlocale("LC_ALL","English") # Used after first attempt


# Web page with players
url = "http://www.baseball-reference.com/bio/Venezuela_born.shtml"

# Create a List of the data-frames found in the Web Page, and define the type of colum data
url_Tables = readHTMLTable(url
                          ,stringAsFactors = FALSE
                          ,colClasses=c("integer","character",rep("integer",17)
                                        ,rep("numeric", 4),"factor","factor"
                                        , "character", "character")
                          )

# Assign First table of the Web Page to a Data.Frame
batting = url_Tables[[1]]

summary(batting)

# Change the type of some colunms 
batting$Birthdate = as.Date(batting$Birthdate, "%b %d, %Y")    # For this column some of the values are parsed OK and others not (NAs).
batting$Debut = as.Date(batting$Debut, "%b %d, %Y")     # For this column all the values are converted as "NA"s
darh78
  • 162
  • 1
  • 11
  • 1
    Why are you representing `Date` objects as `factor`s in the first place? – nrussell May 13 '16 at 18:59
  • I can't reproduce your code. Please try adding the result of `dput(batting)` to your question. – Raphael K May 13 '16 at 19:22
  • I can reproduce the code, but not the error. All dates, `batting$Birthdate` and `batting$Debut` are fine on my system. There is no `NA` value. – RHertel May 13 '16 at 20:27
  • You didn't mention the OS. Maybe try `Sys.setlocale("LC_TIME", "en_US.UTF-8")` if you're working on Linux. Or `Sys.setlocale("LC_TIME","English")` on WIndows. The Syntax of `LC_ALL` can be system-specific. See `?Sys.setlocale` for more information. – RHertel May 14 '16 at 00:17
  • Thanks @nrussell for your comments. Indeed I passed the "stringAsFactors = FALSE" argument but it yet considered some variables as factors. – darh78 May 14 '16 at 09:24
  • 1
    Hey @RHertel, thanks for the tip. Indeed the lubridate function "mdy" worked, but I experienced again the same problem when I opened again the script other day. It's now fixed if I run the Sys.setlocale("LC_Time", "English"). My OS is Windows 7 – darh78 May 14 '16 at 17:35

1 Answers1

0

try to install and use package lubridate, very useful for all date times operations:

library(lubridate)
mdy(batting$Debut)
Eric Lecoutre
  • 1,461
  • 16
  • 25