Long to wide format conversion gives duplicate time identifier rows

Question

See the chat transcript before voting to close please

I have the following data:

> dput(head(q,10))
structure(list(Date = structure(c(1471424400, 1471424400, 1471424400, 
1471424401, 1471424401, 1471424406, 1471424407, 1471424415, 1471424417, 
1471424514), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Type = c("ASK", "BID", "ASK", "BID", "ASK", "ASK", "BID", 
    "BID", "BID", "ASK"), Price = c(1749.95, 1611, 1683.9, 1653, 
    1672, 1683.9, 1653, 1654.2, 1663, 1682)), .Names = c("Date", 
"Type", "Price"), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

Then I used the following code:

data.new <- group_by(head(q,10), Date, Type) %>% summarize(price=ifelse(Type[1] == 'ASK', min(Price)

Then used

dputed<-dput(ungroup(data.new))


dput(dputed)
structure(list(Date = structure(c(1471424400, 1471424400, 1471424401, 
1471424401, 1471424406, 1471424407, 1471424415, 1471424417, 1471424514
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Type = c("ASK", 
"BID", "ASK", "BID", "ASK", "BID", "BID", "BID", "ASK"), price = c(1683.9, 
1611, 1672, 1653, 1683.9, 1653, 1654.2, 1663, 1682)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -9L), .Names = c("Date", 
"Type", "price"))

I want to convert it into wide format, where Date, Ask, and Bid are three columns and rows contain both ask and bid vlaues for a particular time stamp. This is what I have tried:

m.q<-dcast(dputed,Date ~ Type, value.var = "price")

But the result have duplicate time stamps. Please notice the time stamp (2016-08-17 09:00:06) in two adjacent rows. Due to which bid and aks are not in the same row:

> dput(m.q)
structure(list(Date = structure(c(1471424400, 1471424401, 1471424406, 
1471424407, 1471424415, 1471424417, 1471424514), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), ASK = c(1683.9, 1672, 1683.9, NA, 
NA, NA, 1682), BID = c(1611, 1653, NA, 1653, 1654.2, 1663, NA
)), .Names = c("Date", "ASK", "BID"), row.names = c(NA, -7L), class = "data.frame")

Please notice when I use complete.cases() I get only 2 rows instead of 3 as 2016-08-17 09:00:06 get deleted because bid and ask values get bifurcated into two rows with the same time stamp.

johny<- m.q[complete.cases(m.q),]
> dput(johny)
structure(list(Date = structure(c(1471424400, 1471424401), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), ASK = c(1683.9, 1672), BID = c(1611, 
1653)), .Names = c("Date", "ASK", "BID"), row.names = 1:2, class = "data.frame")

I have no idea about it. I am also getting error. Please use the link to r data file — Polar Bear, Aug 24 '16 at 12:52
Same question, same user, will you ignore the correct answer again? — NJBurgo, Aug 24 '16 at 12:55
@NJBurgo I am really sorry if you felt bad but the data you generate worked with your solution and not for my data. I humbly request you to use my data. Again I understand that dput is throwing error but help me solve this issue rather than cursing me. I hope you will help me through this. Thank you and respect for the effort you did. — Polar Bear, Aug 24 '16 at 13:00
@Sotos Do you suggest to ask a question why my data is giving error for dput()? — Polar Bear, Aug 24 '16 at 13:07
Why it gives the error is obvious: `vars = list(Date)`. I'm just not sure it's there because you are using one of Hadley's data classes or because you doctored with it. — Roland, Aug 24 '16 at 13:09
@Roland I did not doctored it. This is how I got the data.new data.new <- group_by(head(q,10), Date, Type) %>% summarize(price = ifelse(Type[1] == 'ASK', min(Price), max(Price))) — Polar Bear, Aug 24 '16 at 13:12
Look again, in the m.q dataframe the rows all have a unique timestamp: c(1471424400, 1471424401, 1471424406, 1471424407, 1471424415, 1471424417, 1471424514) — Dave2e, Aug 24 '16 at 13:33
Okay. I have checked dput intake and now its working after using ungroup(data.new). Please hve again used the dcast() but gain view() in Rstudio shows two rows with the same 2016-08-17 09:00:06 timestamp. — Polar Bear, Aug 24 '16 at 13:37
I don't get any duplicates... I get this for Date: `1471424400, 1471424401, 1471424406, 1471424407, 1471424415, 1471424417, 1471424514` — Sotos, Aug 24 '16 at 13:46
I think I get what you're saying now. The best Ask `1683.9` is repeated in row 1 and 3. It should be `1672.0` or lower. And you have a duplicate best bid of `1653.0` at rows 2 and 4. Is this correct? — Pierre L, Aug 24 '16 at 14:02
the screen capture is not the `dput` you gave and not the result of `dcast(UG,Date ~ Type, value.var = "price")` with the `UG` you gave (those 2 are the same though) — Cath, Aug 24 '16 at 14:02
@PierreLafortune please notice 3 and 4 rows they have exact date and time. So Instead of two rows it should be one and bid and ask should should be in that row with a single time stamp. — Polar Bear, Aug 24 '16 at 14:07
restart and try the code again. We are obviously not getting that. If you copy and paste your own code that you posted, the result is different than what you are showing in the picture — Pierre L, Aug 24 '16 at 14:08
@PierreLafortune I have restarted and run it again. But same result again. I can also that 3 and 4 row have different time stamps but view() shows result. — Polar Bear, Aug 24 '16 at 14:37
I have used `View(m.q)` and I cannot reproduce the behavior you are claiming. — Pierre L, Aug 24 '16 at 14:40
I don't understand why you cannot see why we are not getting that result. It is right there in the dput `c(1471424400, 1471424401, 1471424406, 1471424407,..)`. Do you see **1471424406** and **1471424407**?? They are DIFFERENT. — Pierre L, Aug 24 '16 at 14:41
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121757/discussion-between-pierre-lafortune-and-polar-bear). — Pierre L, Aug 24 '16 at 15:09

Pierre L · Answer 1 · 2016-08-24T16:59:29.133

See edit below:

The code that you have provided does not exhibit the behavior you are claiming:

m.q
                 Date    ASK    BID
1 2016-08-17 09:00:00 1683.9 1611.0
2 2016-08-17 09:00:01 1672.0 1653.0
3 2016-08-17 09:00:06 1683.9     NA
4 2016-08-17 09:00:07     NA 1653.0
5 2016-08-17 09:00:15     NA 1654.2
6 2016-08-17 09:00:17     NA 1663.0
7 2016-08-17 09:01:54 1682.0     NA

Using the view command:

View(m.q)

Let's check that column only:

m.q$Date
[1] "2016-08-17 09:00:00 UTC" "2016-08-17 09:00:01 UTC" "2016-08-17 09:00:06 UTC"
[4] "2016-08-17 09:00:07 UTC" "2016-08-17 09:00:15 UTC" "2016-08-17 09:00:17 UTC"
[7] "2016-08-17 09:01:54 UTC"

To check the duplicates also:

duplicated(m.q$Date)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

There is no other way to tell you that the data you provided is NOT duplicated.

Edit

Hold the presses! Don't delete yet. There is something wrong with the display. After receiving your raw data:

library(readxl)
q <- read_excel("~/Data/3_day_1_stock.xlsx", sheet=1, skip=1)
dput(head(q, 10))
# # A tibble: 10 x 3
#                   Date  Type   Price
#                 <time> <chr>   <dbl>
# 1  2016-08-17 09:00:00   ASK 1749.95
# 2  2016-08-17 09:00:00   BID 1611.00
# 3  2016-08-17 09:00:00   ASK 1683.90
# 4  2016-08-17 09:00:01   BID 1653.00
# 5  2016-08-17 09:00:01   ASK 1672.00
# 6  2016-08-17 09:00:06   ASK 1683.90
# 7  2016-08-17 09:00:06   BID 1653.00
# 8  2016-08-17 09:00:14   BID 1654.20
# 9  2016-08-17 09:00:17   BID 1663.00
# 10 2016-08-17 09:01:54   ASK 1682.00

It appears that the dates in rows 6 and 7 are repeated. But upon further inspection they are different:

dput(head(q, 10))
structure(list(Date = structure(c(1471424400, 1471424400, 1471424400, 
1471424401, 1471424401, 1471424406, **1471424407**__, 1471424415, 1471424417, 
1471424514), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
    Type = c("ASK", "BID", "ASK", "BID", "ASK", "ASK", "BID", 
    "BID", "BID", "ASK"), Price = c(1749.95, 1611, 1683.9, 1653, 
    1672, 1683.9, 1653, 1654.2, 1663, 1682)), .Names = c("Date", 
"Type", "Price"), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))

We may have to write an issue report.

Would you please recheck from begining using dput(head(q,10)) portion — Polar Bear, Aug 24 '16 at 14:55
Sadly, I swear by Almighty that whatever I have posted in result part of the question is true. I also accpet what ever you are showing is also true. Please help what are my possible options now. For some background you may check my earlier question: http://stackoverflow.com/questions/39086065/how-to-take-minimum-value-out-of-several-ask-quotes-and-maximum-value-out-of-sev — Polar Bear, Aug 24 '16 at 15:05

score 1 · Accepted Answer · answered Aug 25 '16 at 13:52

Here are the results of observations after the chat session. It is important to notice the problem arise because of an excel import.

First of all it is important to understand how excel store date/time values internally, they're (if you open the sheet xml file) looking like 42599.3750694444 where the integer part is the number of days since January 1st 1900 and the decimal part is the faction (percentage) of the 24h of the day (86400 seconds).

What happens when we import this into R is that you need a conversion to be done from this format to the unix format (number of seconds since January 1st 1970).

Obvisouly there's the usual Floating Point approximation which is visible if you take the values with sprintf:

> sprintf("%.10f",t$Date[5:8])
[1] "1471424401.0000002384" "1471424406.0000002384" "1471424406.9999997616" "1471424414.9999997616"

What happens next is when you 'print' a POSIXct value, the underlying code use time.h from the C library, where time_t is supposed to be a number of seconds. The problem arise here I think, the value is truncated to it's integer part, hence showing a false 9:00:06 time instead of 9:00:07.

POSIX classes in R are supposed to handle fractionnal seconds, but you can only see them with a custom format to strftime:

> strftime(t$Date[5:8],"%Y/%m/%d %H:%M:%OS6")
[1] "2016/08/17 11:00:01.000000" "2016/08/17 11:00:06.000000" "2016/08/17 11:00:06.999999" "2016/08/17 11:00:14.999999"

Your main concern is the printing of the same second for different values, the rest of the functions works as expected using different values for rows 6 and 7 as they are indeed different.

I see two possible Workarounds:

Convert the column to numeric, round it, convert back to POSIXct:

t$Date <- as.POSIXct(round(as.numeric(t$Date)),origin='1970-01-01',tz='UTC')

Export your excel sheet to csv, import it with read_csv and then cast the Date column to date:

z <- read.csv2("c:/Downloads/3_day_1_stock.csv", skip=1, stringsAsFactors=FALSE)
z$Date <- as.POSIXct(z$Date,format="%m/%d/%Y %H:%M:%S",tz='UTC')

For all cases, instead of your dput hack, use skip=1 as parameter to read_* to skip your first line, then the functions will detect proper columns.

Long to wide format conversion gives duplicate time identifier rows

See the chat transcript before voting to close please

2 Answers2