0

I have a .txt file and am using Rstudio.

200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804
200416657210345 1665721 20040907 20090203 20070331 20080719                  
200416657210347 1665721 20040914 20091026 20070213 20080114 20090302         
200416657210352 1665721 20041111 20100315 20070123 20071205          20081202

I am trying to read in the .txt file using read.fwf :

gripalisti <- read.fwf(file = "gripalisti.txt",
                         widths = c(15,8,9,9,9,9,9,9),
                         header = FALSE,
                         #stringsAsFactors = FALSE, 
                       col.names = c("einst","bu","faeding","forgun","burdur1",
                                     "burdur2","burdur3","burdur4"))

This works and the columns are the correct lenght. However the "einst" and "bu" are supposed to be integer values and the rest are supposed to be dates.

When imported all the values in the first column (ID variables) look like this:

2.003140e+14

I have been trying to search for a way to change the imported column to integer (or character?) values and I have not found anything that does not result in an error. An example, that I tried after a google:

gripalisti <- read.fwf(file = "gripalisti.txt",
                         widths = c(15,8,9,9,9,9,9,9),
                         header = FALSE,
                         #stringsAsFactors = FALSE, 
                       col.names = c("einst","bu","faeding","forgun","burdur1",
                                     "burdur2","burdur3","burdur4"),
                       colclasses = c("integer", "integer", "Date", "Date",
                                      "Date", "Date", "Date", "Date"))

results in the error:

Error in read.table(file = FILE, header = header, sep = sep, row.names = row.names,  : 
  unused argument (colclasses = c("integer", "integer", "Date", "Date", "Date", "Date", "Date", "Date"))

There are many missing values in the dataset that is over 100.000 lines. So other ways of importing have not worked for me. The dataset is NOT tab delimited.

Sorry if this is obvious, I am a very new R user.

edit:

Thanks for the help, I changed it to:

 colClasses = c("character", 

And now it look good.

Thordis
  • 87
  • 1
  • 11
  • 4
    use `colClasses` instead of `colclasses` ie the `C` is capitalized – Onyambu May 30 '21 at 17:47
  • 2
    A value like 2.003140e+14 cannot be represented as a 32bit integer. See help file `?integer` that tells us: "that current implementations of R use 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9: doubles can hold much larger integers exactly." – tpetzoldt May 30 '21 at 17:49
  • 2
    Continuing on @tpetzoldt's comment, the largest integer that R currently supports natively is `.Machine$integer.max`, which on my machine resolves to `2,147,483,647` (commas added for scale); compared with your `200,416,657,210,340`, your numbers are *five* orders of magnitude too large, so R has to resort to using floating-point to store/represent them. The [`bit64`](https://cran.r-project.org/web/packages/bit64/index.html) package can be useful, though it is not supported in every R transaction. – r2evans May 30 '21 at 17:53
  • 2
    Perhaps this may be moot: many FWF files I've seen use numbers like that for IDs or similar, not for literal numbers, so it might be better to read them in as `character` instead of `numeric` (and `integer` will fail, as stated). So I suggest you use `colClasses` (capital `C`, as Onyambu suggested), and force that cast that first field as `"character"`. – r2evans May 30 '21 at 17:56

2 Answers2

1

As suggested in the comments:

  1. it is colClasses=, not colclasses=, typo;
  2. that first field cannot be stored as "integer", it must either be "numeric" or "character";
  3. (additionally) those dates are not in the default format of %Y-%m-%d, you will need to convert them after reading in the data.

Prep:

writeLines("200416657210340 1665721 20040608 20090930 20060910 20070910 20080827 20090804\n200416657210345 1665721 20040907 20090203 20070331 20080719                  \n200416657210347 1665721 20040914 20091026 20070213 20080114 20090302         \n200416657210352 1665721 20041111 20100315 20070123 20071205          20081202",
           con = "gripalisti.txt")

Execution:

dat <- read.fwf("gripalisti.txt", widths = c(15,8,9,9,9,9,9,9), header = FALSE,
                col.names = c("einst","bu","faeding","forgun","burdur1", "burdur2","burdur3","burdur4"),
                colClasses = c("character", "integer", "character", "character", "character", "character", "character", "character"))
str(dat)
# 'data.frame': 4 obs. of  8 variables:
#  $ einst  : chr  "200416657210340" "200416657210345" "200416657210347" "200416657210352"
#  $ bu     : int  1665721 1665721 1665721 1665721
#  $ faeding: chr  " 20040608" " 20040907" " 20040914" " 20041111"
#  $ forgun : chr  " 20090930" " 20090203" " 20091026" " 20100315"
#  $ burdur1: chr  " 20060910" " 20070331" " 20070213" " 20070123"
#  $ burdur2: chr  " 20070910" " 20080719" " 20080114" " 20071205"
#  $ burdur3: chr  " 20080827" "         " " 20090302" "         "
#  $ burdur4: chr  " 20090804" "         " "         " " 20081202"

dat[,3:8] <- lapply(dat[,3:8], as.Date, format = "%Y%m%d")
dat
#             einst      bu    faeding     forgun    burdur1    burdur2    burdur3    burdur4
# 1 200416657210340 1665721 2004-06-08 2009-09-30 2006-09-10 2007-09-10 2008-08-27 2009-08-04
# 2 200416657210345 1665721 2004-09-07 2009-02-03 2007-03-31 2008-07-19       <NA>       <NA>
# 3 200416657210347 1665721 2004-09-14 2009-10-26 2007-02-13 2008-01-14 2009-03-02       <NA>
# 4 200416657210352 1665721 2004-11-11 2010-03-15 2007-01-23 2007-12-05       <NA> 2008-12-02

str(dat)
# 'data.frame': 4 obs. of  8 variables:
#  $ einst  : chr  "200416657210340" "200416657210345" "200416657210347" "200416657210352"
#  $ bu     : int  1665721 1665721 1665721 1665721
#  $ faeding: Date, format: "2004-06-08" "2004-09-07" "2004-09-14" "2004-11-11"
#  $ forgun : Date, format: "2009-09-30" "2009-02-03" "2009-10-26" "2010-03-15"
#  $ burdur1: Date, format: "2006-09-10" "2007-03-31" "2007-02-13" "2007-01-23"
#  $ burdur2: Date, format: "2007-09-10" "2008-07-19" "2008-01-14" "2007-12-05"
#  $ burdur3: Date, format: "2008-08-27" NA "2009-03-02" NA
#  $ burdur4: Date, format: "2009-08-04" NA NA "2008-12-02"
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

here the number in the first column is very large number, if you import it in term of integer or numeric it will automatically shown in exponent format. The way to resolve this to set scipen before reading the file. use below code :

options(scipen = 999)

enter image description here

I think this should resolve your problem.

Below is code I run, of course for date columns you need to to work. For that you can use simple command like as.Date(gripalisti$burdur1, format = "%Y%m%d")

enter image description here

Anup Tirpude
  • 624
  • 5
  • 8
  • 1
    The OP asked about casting it to `integer` (or `"character"`), so `"scipen"` does nothing. – r2evans May 30 '21 at 17:58
  • I tried the code, and its working ... I am adding more information to the answer to clear the topic – Anup Tirpude May 30 '21 at 18:15
  • Yes, but please realize that the `: num` indicates that the `einst` field is `numeric`, and the OP is expecting `"integer"` or `"character"`. (Plus, please don't include pictures of code, data, or results, it's better ro include the raw text [for several reasons](https://meta.stackoverflow.com/a/285557).) – r2evans May 30 '21 at 18:26
  • Yeah but once you have in numeric also converting to integer is not a big deal, it resolved the exponent problem.. now just use as.integer() and you will have it.. Now not setting the scipen will always penalize this column and will show it in exponent... Thordis let me know if this answers your question or not.. or any more help needed – Anup Tirpude May 30 '21 at 18:32
  • The use of `scipen=` does no harm, but it does nothing to change the storage of the object: that column is still `"numeric"`, which is specifically something the OP requested to change. I agree that `scipen=` addresses concerns about scientific notation, but the underlying problem is not about rendering/presentation on the console, it's about storage class. – r2evans May 30 '21 at 18:38