1

I am working with R and the package 'elastic' to query an elastic search db containing twitter data in JSON format. The query works fine and I get the output content (out) as I expect.

class(out) 
[1] "list"

and out$hits$hits returns

> out$hits$hits
[[1]]
[[1]]$`_index`
[1] "twitter_all_geo-2014-11-01"

[[1]]$`_type`
[1] "ctweet"

[[1]]$`_id`
[1] "ubicity-twitter-160f0964-6fc7-43ef-af2a-0e1b8c8184c7"

[[1]]$`_version`
[1] 1

[[1]]$`_score`
[1] 2.10757

[[1]]$`_source`
[[1]]$`_source`$id
[1] "528330489049120770"

[[1]]$`_source`$created_at
[1] "2014-10-31T23:39:39+0000"

[[1]]$`_source`$user
[[1]]$`_source`$user$name
[1] "afterlifetemis"


[[1]]$`_source`$place
[[1]]$`_source`$place$geo_point 
[[1]]$`_source`$place$geo_point[[1]]
[1] 30.4529

[[1]]$`_source`$place$geo_point[[2]]
[1] 50.61104


[[1]]$`_source`$place$city
[1] "Ukraine"

[[1]]$`_source`$place$country
[1] "Ukraine"

[[1]]$`_source`$place$country_code
[1] "UA"

[[1]]$`_source`$msg
[[1]]$`_source`$msg$text
[1] "u had one job artemis\none"

[[1]]$`_source`$msg$lang
[1] "EN"

[[1]]$`_source`$msg$hash_tags
list()

[[2]]
[[2]]$`_index`
[1] "twitter_all_geo-2014-11-01"

[[2]]$`_type`
[1] "ctweet"
...
...

Basically I wanted to save the data as .csv file, so I entered

> write.csv(out$hits$hits,'out.csv')
Error in data.frame(text = "u had one job artemis\none", lang = "EN",   : arguments imply differing number of rows: 1, 0

I assumed that it is necessary to convert it to an data.frame, so I tried:

> df <- ldply (out, data.frame)

Error in data.frame(text = "u had one job artemis\none", lang = "EN", : arguments imply differing number of rows: 1, 0

(I tried several other, optimistc, attempts too like this one:)

> t(sapply(out$hits$hits, '[', 1:max(sapply(out$hits$hits, length))))
  _index                       _type    _id                                                        _version _score  _source
[1,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-160f0964-6fc7-43ef-af2a-0e1b8c8184c7" 1        2.10757 List,5 
[2,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-ba071fff-cafb-4d3f-947d-13c934905c1b" 1        2.10757 List,5 
[3,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-dd64af32-4d59-4008-a3db-74471ad269d1" 1        2.10757 List,5 
[4,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-4ba0d3d0-642d-4f9f-aaf9-c55929c35dc4" 1        2.10757 List,5 
[5,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-d7b8cbbc-87b3-44b5-8c9c-91c7b62f1458" 1        2.10757 List,5 
[6,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-76353a7c-44c9-4863-a59d-adb16716ca18" 1        2.10757 List,5 
[7,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-2aec0798-9918-4b66-9b2a-ef5a4d1f3711" 1        2.10757 List,5 
[8,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-c9e7637d-358a-40ee-a06c-85af04c22191" 1        2.10757 List,5 
[9,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-8928c1ef-f46a-4682-99c4-4dbc55270b03" 1        2.10757 List,5 
[10,] "twitter_all_geo-2014-11-01" "ctweet" "ubicity-twitter-d6b19975-b310-46c4-af11-af56971b7c4b" 1        2.10757 List,5 

And in the beginning it looked good, but the actual tweet message isn't anymore in the matrix

I was optimistic and thought maybe convert it first (back) to JSON (using RJSON)

toJSON(out) Error in toJSON(out) : unable to escape string. String is not utf8

At the end I have a list and can not save, can not convert to JSON, data.frame or data.table (because it is not uniform). Does anyone can give me an hint on a) convert it to JSON or on how to save the list to a .csv file or to put it in a data.frame?

Thanks a lot, I think I don't understand it.

-Tobias

2 Answers2

2

I think unlist() and matrix() can do the job.

An example converting the Search()-return out into data frame:

# get the first 3 hits from elasticsearch store
out <- Search(index="shakespeare", size=3)

# (optional) verify that all hits expand to the same length
# (should be true for data intended to be in a table format)
stopifnot(
    sapply(
        out$hits$hits, 
        function(x) {!(length(unlist(x)) - length(unlist(out$hits$hits[[1]])))}
    )
)

# count number of columns, use unlist() to convert 
# nested lists to a vector, use the first hit as proxy
nColumns <- length(unlist(out$hits$hits[[1]]))

# fetch column names ... as above
nNames <- names(unlist(out$hits$hits[[1]]))

# unlist all hits and convert to matrix with ncol Columns, don't forget byrow=TRUE!
df <- data.frame(matrix(unlist(out$hits$hits), ncol=nColumns, byrow=TRUE))

# setting the column names
names(df) <- nNames

# do whatever you want with df
print(df)

Cheers!

dtrv
  • 693
  • 1
  • 6
  • 14
  • Thanks! I have to try this approach. What I did in between is to filter out the messages, put them in a list and then it is quite easy. But your way actually saves the complete result and was what I wanted in the beginning. – tobiokanobi Apr 29 '15 at 18:16
1

you can use "jqr" package in R. For eg:-

datacsv<-jq(out,".hits.hits[] | @csv")   

It will save your data into csv format and with the help of "jqr" you can also grep the fields that you want.

Tunaki
  • 132,869
  • 46
  • 340
  • 423
vikas chib
  • 83
  • 8