using dcast (reshape2) to mold data to wide format populates cells with count instead of value

Question

I have a dataframe in long format (https://docs.google.com/spreadsheets/d/15jDW1pCYs7VD6MAH9GHmYrP-zf8WjsGj9F3qVXGijDM/edit?usp=sharing) that looks like this:

      objectid  timestamp code_bbch_surveyed
   1:   702509 2018-03-23                 NA
   2:   702509 2018-03-23                 NA
   3:   702509 2018-03-23                 NA
   4:   702509 2018-03-23                 NA
   5:   702509 2018-03-23                 NA
  ---                                       
5581:   293171 2018-10-17               GMA3
5582:   293171 2018-10-17               GMA3
5583:   293171 2018-10-17               GMA3
5584:   293171 2018-10-17               GMA3
5585:   293171 2018-10-17               GMA3

I want to cast it to wide format so that every row is a unique objectid, every column is a unique timestamp and the cells are populated by the respective code_bbch_surveyed.

I've tried what seems the most logical implementation of dcast like so:

dcast(setDT(df_scr), objectid ~ timestamp, value.var = 'code_bbch_surveyed')

but this produces an output dataframe/datatable where each cell is populated by the COUNT/number of instances. I wish to NOT count the instances but simply populated the cell with the value in code_bbch_surveyed.

So instead of the output like this (row1) :

        objectid 2018-03-23 2018-04-23 2018-05-21 2018-06-20 2018-07-09 2018-08-15 2018-09-20 2018-10-17
 1:     8100         27         22         16         14         15         14         12         15

I would like to see an output like this row(1):

        objectid 2018-03-23 2018-04-23 2018-05-21 2018-06-20 2018-07-09 2018-08-15 2018-09-20 2018-10-17
 1:     8100         SCR2         WWH3    [null]     [null]     [null]    [null]     [null]     [null]

The problem is that each combination of objectid and timestamp has more than one value and dcast() defaults to aggregate these with length(). If you get rid of the duplicate rows, your code should work. — Gilean0709, Apr 11 '19 at 13:55
Suspecting this I checked at random for some combinations and they all checked out as single value. I will dig deeper! — Momchill, Apr 11 '19 at 14:00
But in your example you can see for the two combinations there, that at least 5 values exist for each combination. — Gilean0709, Apr 11 '19 at 14:04
I'm not following, sorry. According to my intuition there should be only one value (`crop_bbch_surveyed`) for each `objectid ~ timestamp` combination. So for 2018-03-23 and objectid 8100 the value is SCR2, for 2018-03-23 it is WWH3, etc, etc — Momchill, Apr 11 '19 at 14:09
So only by looking at your example data, you can see that for 2018-10-17 and objectid 293171 you have the value 'GMA3' at least 5 times. The value might be unique, but its occurence is not. It still counts as 5 values. Therefore if you use `df_scr[!duplicated(df_scr), ]` in dcast() it should work. — Gilean0709, Apr 11 '19 at 14:18
Aaaaah, clarity! So this works only for single value in the sense of occurrence and not just variety in general. Alright. Yes, using `duplicated` and then removing some pesky `NA`s that were lurking around my data did the trick. Thank you! — Momchill, Apr 11 '19 at 14:23

using dcast (reshape2) to mold data to wide format populates cells with count instead of value

0 Answers0