0

I am trying to extract my data from a csv file into R. The data is currently formatted in the following:

,"[{time=2014-01-01T00:00:00, NDVI=0.3793765496776215}, {time=2014-02-01T00:00:00, NDVI=...

,"[{time=2014-01-01T00:00:00, NDVI=0.4071076986818826}, {time=2014-02-01T00:00:00, ...

,"[{time=2014-01-01T00:00:00, NDVI=0.3412131556625801}, {time=2014-02-01T00:00:00, NDVI=...

Each of these lines is data for a different region. I want to have it in this format:

Region [1]

Time       NDVI

[1]        [1]

[2]        [2]

[3]        [3]

[...]      [...]


Region [2]

Time       NDVI

[1]        [1]

[2]        [2]

[3]        [3]

[...]      [...]


Region [3]

Time       NDVI

[1]        [1]

[2]        [2]

[3]        [3]

[...]      [...]

How can I do this?

Nakx
  • 1,460
  • 1
  • 23
  • 32
eviep
  • 17
  • 5
  • What efforts have you made so far and where do you get stuck? What you have there may be in JSON format and R has packages that can parse that. – Roman Luštrik Mar 24 '20 at 06:52

1 Answers1

0

Maybe there is a package that can parse this. However, you could do some data transformation using the tidyverse package.

You can read in your data with readLines():

dat <- readLines("test.txt")

Which in this case looks like this:

dat <- c(",\"[{time=2014-01-01T00:00:00, NDVI=0.3793765496776215}, {time=2014-02-01T00:00:00, NDVI=2}]\"", 
"", ",\"[{time=2014-01-01T00:00:00, NDVI=0.4071076986818826}, {time=2014-02-01T00:00:00, NDVI=3}]\"", 
"", ",\"[{time=2014-01-01T00:00:00, NDVI=0.3412131556625801}, {time=2014-02-01T00:00:00, NDVI=4}]\""
)

Then you can do some data transformations with a for loop and store the result in a list.

library(tidyverse)
dat <- c(",\"[{time=2014-01-01T00:00:00, NDVI=0.3793765496776215}, {time=2014-02-01T00:00:00, NDVI=2}]\"", 
         "", ",\"[{time=2014-01-01T00:00:00, NDVI=0.4071076986818826}, {time=2014-02-01T00:00:00, NDVI=3}]\"", 
         "", ",\"[{time=2014-01-01T00:00:00, NDVI=0.3412131556625801}, {time=2014-02-01T00:00:00, NDVI=4}]\""
)

l <- list()
counter <- 1
for (line in dat){
  if(nchar(line)>0){
    line <- as.data.frame(line) %>%
        # We need to remove some unwanted strings
      mutate(line = str_replace_all(line, 
                                    c("\\\""="", ",\\["= "", "\\]" = ""))) %>% 
        # The lines can be separated into rows where the string "}, {" occurs
               separate_rows(line, line, sep = "\\}, \\{") %>% 
        # again removing some unwanted strings
      mutate(line = str_replace_all(line, c("\\{"="", "\\}"=""))) %>% 
        # add a unique identifier for each observation
      mutate(observation = row_number()) %>% 
        # separete the rows where a "," occurs
      separate_rows(line, line, sep =",") %>% 
      separate(., line, into = c("category", "value"), sep = "=") %>% 
        # put it into the long format
      pivot_wider(names_from = category, values_from = value)
    l[[counter]] <- line
    counter <- counter+1
  }
}

l
#> [[1]]
#> # A tibble: 2 x 3
#>   observation time                ` NDVI`           
#>         <int> <chr>               <chr>             
#> 1           1 2014-01-01T00:00:00 0.3793765496776215
#> 2           2 2014-02-01T00:00:00 2                 
#> 
#> [[2]]
#> # A tibble: 2 x 3
#>   observation time                ` NDVI`           
#>         <int> <chr>               <chr>             
#> 1           1 2014-01-01T00:00:00 0.4071076986818826
#> 2           2 2014-02-01T00:00:00 3                 
#> 
#> [[3]]
#> # A tibble: 2 x 3
#>   observation time                ` NDVI`           
#>         <int> <chr>               <chr>             
#> 1           1 2014-01-01T00:00:00 0.3412131556625801
#> 2           2 2014-02-01T00:00:00 4

Created on 2020-03-24 by the reprex package (v0.3.0)

MKR
  • 1,620
  • 7
  • 20
  • Thank you. This worked but now I am struggling to edit the time column in this data. Is this a list style? How to I isolate and edit specific columns in different layers in this data class? I have tried 'data[2][2]' and 'data[2[2]]' but I do not seem to be able to print the time column for any layer. Thanks for your help. – eviep Mar 26 '20 at 03:30
  • sorry for the late reply. You now have tibbles nested in a list. Thus, you can either access the tibbles with the indexing operator `l[[1]]`. Then you can access the time variable by the same way you would access it in a tibble like `l[[1]][,"time"]`. Alternatively, you could also just bind the tibble to on e big tibble with `bind_rows()`. – MKR Apr 03 '20 at 09:36