2

I could really use some help here with my RStudio.

I am trying out this analysis and seem to have problem converting data type of certain variables.

library(tidyverse)
library(lubridate)
library(ggplot2)
library(magrittr)

Nov2020 <- read_csv("202011-divvy-tripdata.csv")
str(Nov2020)

The output is as of below:

spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
 $ rideable_type     : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" "2020-11-01 00:34:05" "2020-11-01 00:45:16" ...
 $ ended_at          : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" "2020-11-01 01:03:06" "2020-11-01 00:54:31" ...
 $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
 $ start_station_id  : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
 $ end_station_name  : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
 $ end_station_id    : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
 $ start_lat         : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
 $ end_lat           : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
 $ member_casual     : chr [1:259716] "casual" "casual" "casual" "casual" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_double(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_double(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

As you can see, the 'start_station_id' and 'end_station_id' are both <col_double()> variable type. I need to convert them to character type so I can stack them with other months data.

Nov2020 %>%
  mutate(start_station_id=as.character(start_station_id),
         end_station_id=as.character(end_station_id))

After applying that step, the output is of below:

# A tibble: 259,716 x 13
   ride_id   rideable_type started_at          ended_at            start_station_na~ start_station_id end_station_name 
   <chr>     <chr>         <dttm>              <dttm>              <chr>             <chr>            <chr>            
 1 BD0A6FF6~ electric_bike 2020-11-01 13:36:00 2020-11-01 13:45:40 Dearborn St & Er~ 110              St. Clair St & E~
 2 96A7A7A4~ electric_bike 2020-11-01 10:03:26 2020-11-01 10:14:45 Franklin St & Il~ 672              Noble St & Milwa~
 3 C61526D0~ electric_bike 2020-11-01 00:34:05 2020-11-01 01:03:06 Lake Shore Dr & ~ 76               Federal St & Pol~
 4 E533E89C~ electric_bike 2020-11-01 00:45:16 2020-11-01 00:54:31 Leavitt St & Chi~ 659              Stave St & Armit~
 5 1C9F4EF1~ electric_bike 2020-11-01 15:43:25 2020-11-01 16:16:52 Buckingham Fount~ 2                Buckingham Fount~
 6 7259585D~ electric_bike 2020-11-14 15:55:17 2020-11-14 16:44:38 Wabash Ave & 16t~ 72               Lake Shore Dr & ~
 7 91FE5C8F~ electric_bike 2020-11-14 16:47:29 2020-11-14 17:03:03 Lake Shore Dr & ~ 76               Wabash Ave & 16t~
 8 9E7A79AD~ electric_bike 2020-11-14 16:04:15 2020-11-14 16:19:33 NA                NA               NA               
 9 A5B02C0D~ electric_bike 2020-11-14 16:24:09 2020-11-14 16:51:34 Marshfield Ave &~ 58               Larrabee St & Ar~
10 8234407C~ electric_bike 2020-11-14 01:24:22 2020-11-14 01:31:42 Clark St & 9th S~ 394              Michigan Ave & 1~
# ... with 259,706 more rows, and 6 more variables: end_station_id <chr>, start_lat <dbl>, start_lng <dbl>,
#   end_lat <dbl>, end_lng <dbl>, member_casual <chr>

You can see both fields are now of variable, which is what I want.

However, when I run the structure code again, the data type is still as of original: <col_double()>.

str(Nov2020)
spec_tbl_df [259,716 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ride_id           : chr [1:259716] "BD0A6FF6FFF9B921" "96A7A7A4BDE4F82D" "C61526D06582BDC5" "E533E89C32080B9E" ...
 $ rideable_type     : chr [1:259716] "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : POSIXct[1:259716], format: "2020-11-01 13:36:00" "2020-11-01 10:03:26" "2020-11-01 00:34:05" "2020-11-01 00:45:16" ...
 $ ended_at          : POSIXct[1:259716], format: "2020-11-01 13:45:40" "2020-11-01 10:14:45" "2020-11-01 01:03:06" "2020-11-01 00:54:31" ...
 $ start_station_name: chr [1:259716] "Dearborn St & Erie St" "Franklin St & Illinois St" "Lake Shore Dr & Monroe St" "Leavitt St & Chicago Ave" ...
 $ start_station_id  : num [1:259716] 110 672 76 659 2 72 76 NA 58 394 ...
 $ end_station_name  : chr [1:259716] "St. Clair St & Erie St" "Noble St & Milwaukee Ave" "Federal St & Polk St" "Stave St & Armitage Ave" ...
 $ end_station_id    : num [1:259716] 211 29 41 185 2 76 72 NA 288 273 ...
 $ start_lat         : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num [1:259716] -87.6 -87.6 -87.6 -87.7 -87.6 ...
 $ end_lat           : num [1:259716] 41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num [1:259716] -87.6 -87.7 -87.6 -87.7 -87.6 ...
 $ member_casual     : chr [1:259716] "casual" "casual" "casual" "casual" ...
 - attr(*, "spec")=
  .. cols(
  ..   ride_id = col_character(),
  ..   rideable_type = col_character(),
  ..   started_at = col_datetime(format = ""),
  ..   ended_at = col_datetime(format = ""),
  ..   start_station_name = col_character(),
  ..   start_station_id = col_double(),
  ..   end_station_name = col_character(),
  ..   end_station_id = col_double(),
  ..   start_lat = col_double(),
  ..   start_lng = col_double(),
  ..   end_lat = col_double(),
  ..   end_lng = col_double(),
  ..   member_casual = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Am I missing something here? I tried renaming the dataset to a new name after mutating: 'Nov2020_v2' for example, but the result is the same.

Because of this issue I can't proceed with my analysis to stack this dataset up with other months data, where these 2 variables are of character type.

Any help will be greatly appreciated! Thanks!

Jiawei
  • 23
  • 2
  • 2
    Running a chain of `dplyr` commands produces output but does not on its own change the input. You probably want `Nov2020 <- Nov2020 %>% mutate(start_station_id= ...` to assign the output to the input table. – Jon Spring Nov 16 '21 at 05:29
  • Hi Jon, Thanks for your prompt reply. I tried doing that but I realized something, after applying that method and re-running `str(Nov2020)`, at the spec_tbl_df I get `$ start_station_id : chr [1:259716] "110" "672" "76" "659" ...` while at the attr(*, "spec")= I get `start_station_id = col_double(),`. Any idea why both sections are showing different variable type? Nonetheless, after I applied that step I can now stack with other months dataset, so thanks for your help! Just the the above question remains in my mind. – Jiawei Nov 16 '21 at 05:41
  • I believe the "spec" part is metadata that records what column type was assumed by `read_csv` when you first brought it in. Can ignore. – Jon Spring Nov 16 '21 at 15:24

1 Answers1

0

You can also specify the variable type when the data is read using read_csv.

Something like this:

Nov2020 <- read_csv("202011-divvy-tripdata.csv", 
                    col_types = cols(start_station_id = col_character(),
                                     end_station_id = col_character())

See ?read_csv for more information.

Another approach is to import the CSV file via the RStudio interface and alter the column types in the preview.

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • Hi, Thanks for your suggestion. You mentioned in your last statement we can alter the column types in the preview, how do we go about doing that? Any links you can share with me please? Thanks! – Jiawei Nov 16 '21 at 06:03
  • You just choose "import dataset" in RStudio, or navigate to the file in the File pane and click on it. When the preview opens, click in the column header to change the type. – neilfws Nov 16 '21 at 09:27
  • Thank you so much! I've not used this function before and it is really useful! Cheers! – Jiawei Nov 19 '21 at 07:38