8

I have the event logs loaded in elasticsearch engine and I visualise it using Kibana. My event logs are actually stored in the Google Big Query table. Currently I am dumping the json files to a Google bucket and download it to a local drive. Then using logstash, I move the json files from the local drive to the elastic search engine.

Now, I am trying to automate the process by establishing the connection between google big query and elastic search. From what I have read, I understand that there is a output connector which sends the data from elastic search to Google big query but not vice versa. Just wondering whether I should upload the json file to a kubernete cluster and then establish the connection between the cluster and Elastic search engine.

Any help with this regard would be appreciated.

user3447653
  • 3,968
  • 12
  • 58
  • 100
  • 2
    I currently do this with a convoluted process where I export JSON files that I then transfer to AWS S3 and then use Logstash with the S3 input to index into Elastic. It's less than ideal for a few reasons, one being that the indexing is quite slow. I'm working on a better solution. Keen to hear how you get on. I'll share my progress. – Raoot Sep 01 '16 at 15:38
  • 1
    As alternative, have you tried deploying an Elasticsearch [solution](https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjb88ux1NbSAhVm2IMKHQ8lD6EQFggnMAA&url=https%3A%2F%2Fconsole.cloud.google.com%2Flauncher%2Fdetails%2Fclick-to-deploy-images%2Felasticsearch&usg=AFQjCNEe0FC7nFgtJOiYYkD2TRON1_Bp1g&bvm=bv.149397726,d.amcFelasticsearch&usg=AFQjCNEe0FC7nFgtJOiYYkD2TRON1_Bp1g&bvm=bv.149397726,d.amc) to Google Compute Engine? – Alex Mar 14 '17 at 19:40
  • 1
    In fact, as described on this third party [site](https://docs.bitnami.com/google/faq/#how-to-upload-files-to-the-server-with-sftp), it is possible to upload files directly to an Elasticsearch GCE solution using SFTP. Therefore, in consideration to your current method, you may be able to generate a script that would gather the JSON file from Cloud Storage and send it via SFTP to your Elasticsearch server. – Alex Mar 14 '17 at 19:41
  • 1
    @Macarse I'm curious what you want to do in ES that you can't do in GBQ? – Val Feb 26 '19 at 07:02
  • @Val my company is already using ES and I have my data in GBQ. – Macarse Feb 26 '19 at 13:51
  • @Macarse ok but that doesn't really explain the use case, does it? – Val Feb 26 '19 at 13:52
  • @Val I want to query all the data on the same place. – Macarse Feb 27 '19 at 14:16
  • @Macarse is that a one time job or will you do this frequently? – Val Feb 27 '19 at 14:18
  • 2
    GBQ can give you very complex reports on large datasets. I'm not sure you will even get as complex reports through elasticsearch. Also, depending on the data size, you will be spending a lot of money/resources on hosting elasticsearch. You can also connect your GBQ with Data Studio and generate visualisations. I'm not aware of your exact use case but, personally, if i had to query data in one place and there was already enough data stored in GBQ, i'd look at pushing elastic data to GBQ. – Archit Saxena Feb 28 '19 at 12:29
  • 1
    Good point @ArchitSaxena ! That's probably the reason why you can find many connectors that go from ES to GBQ but not so many in the opposite direction. – Val Feb 28 '19 at 14:27

4 Answers4

5

Although this solution may be a little complex, I suggest some solution that you use Google Storage Connector with ES-Hadoop. These two are very mature and used in production-grade by many great companies.

Logstash over a lot of pods on Kubernetes will be very expensive and - I think - not a very nice, resilient and scalable approach.

Allan Sene
  • 459
  • 4
  • 15
5

Apache Beam has connectors for BigQuery and Elastic Search, I would definitly perform this using DataFlow so you don´t need to implement a complex ETL and staging storage. You can read the data from BigQuery using BigQueryIO.Read.from (take a look to this if performance is important BigQueryIO Read vs fromQuery) and load it into ElasticSearch using ElasticsearchIO.write()

Refer this how read data from BigQuery Dataflow

https://github.com/GoogleCloudPlatform/professional-services/blob/master/examples/dataflow-bigquery-transpose/src/main/java/com/google/cloud/pso/pipeline/Pivot.java

Elastic Search indexing

https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/dataflow-elasticsearch-indexer

UPDATED 2019-06-24

Recently this year was release BigQuery Storage API which improve the parallelism to extract data from BigQuery and is natively supported by DataFlow. Refer to https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-api for more details.

From the documentation

The BigQuery Storage API allows you to directly access tables in BigQuery storage. As a result, your pipeline can read from BigQuery storage faster than previously possible.

hlagos
  • 7,690
  • 3
  • 23
  • 41
1

I have recently worked on a similar pipeline. A workflow I would suggest would either use the mentioned Google storage connector, or other methods to read your json files into a spark job. You should be able to quickly and easily transform your data, and then use the elasticsearch-spark plugin to load that data into your Elasticsearch cluster.

You can use Google Cloud Dataproc or Cloud Dataflow to run and schedule your job.

Cory Grinstead
  • 511
  • 3
  • 16
0

As of 2021, there is a Dataflow template that allows a "GCP native" connection between BigQuery and ElasticSearch More information here in a blog post by elastic.co Further documentation and step by step process by google

Ismail H
  • 4,226
  • 2
  • 38
  • 61