0

I am reading in ndjson file (~1Gb) with large IDs. The IDs are around 19 digits and lose precision when streamed in. The last 4-5 digits differ. How can I avoid this? Thank you!

library(jsonlite)
data_out <- data.frame(userID = c(1123581321345589000, 3141592653589793000, 2718281828459045000),
                   variable = c("a", "b", "c"))

con_out <- file("test_output.json", open = "wb")
jsonlite::stream_out(data_out, con_out, auto_unbox = T)
close(con_out)

con_in <- file("test_output.json")
data_in <- jsonlite::stream_in(con_in)

> format(data_in$userID, scientific = F)
[1] "1123581321345590016" "3141592653589790208" "2718281828459039744"

edit: I have no control over the input file or its formats. If I open the input file in the editor, the IDs are correct. The "error" happens when streaming in.

qwertzuiop
  • 172
  • 8
  • These values are beyond even 64-bit floating point values which can only store consecutive integers up to 9,007,199,254,740,992 (2^53) without losing precision. Beyond that point you need to treat IDs this large differently. Either code them as strings if that is sufficient for sorting/arranging or explore packages like bignum at CRAN. – John Garland Jun 01 '22 at 09:07

1 Answers1

0

You could convert userID to character:

library(jsonlite)
data_out <- data.frame(userID = c(1123581321345589000, 3141592653589793000, 2718281828459045000),
                       variable = c("a", "b", "c"))

# Convert to character
data_out$userID <- as.character(data_out$userID)

con_out <- file("test_output.json", open = "wb")
jsonlite::stream_out(data_out, con_out, auto_unbox = T)
#> Complete! Processed total of 3 rows.
close(con_out)

con_in <- file("test_output.json")
data_in <- jsonlite::stream_in(con_in)
#> opening file input connection.
#>  Found 3 records... Imported 3 records. Simplifying...
#> closing file input connection.

identical(data_in,data_out)
#> [1] TRUE
Waldi
  • 39,242
  • 6
  • 30
  • 78
  • True, but unfortunately I have no control over the input file or its formats. If I open the input file in the editor, the IDs are correct. The "error" happens when streaming in. – qwertzuiop Jun 01 '22 at 09:02
  • If reading as strings doesn't give you the power you need--and they might if you examine your actual needs for these identifiers, try looking at the functions in either packages bignum or the newer gmp at CRAN which deal with arbitrarily large integers and very high precision floats. – John Garland Jun 01 '22 at 09:19
  • They IDs are not saved as characters and I have no control over that. But reading them as strings would be perfect, I just don't know how to force the stream_in to do that. I'm also going to have a look at the packages you recommended, thank you! – qwertzuiop Jun 01 '22 at 09:25
  • You might get by with an int64 package like vroom in tidyverse which gives true64 bit integers in R. That would get you up to 20 places (2^64 = 18,446,744,073,709,551,616). It is supposed to be well integrated into the whole tidyverse. – John Garland Jun 01 '22 at 09:29
  • If big integers don't work, direct parsing might be a solution, see https://stackoverflow.com/questions/65009389/parse-json-with-big-integer-in-r – Waldi Jun 01 '22 at 10:07