0

I wanted to use reactive spring to index data from a SQL Server database (using R2DBC) to Elasticsearch (using reactive Elasticsearch). I have an entity class corresponding to the table in SQL Server and I have a model class corresponding to the documents which will be indexed by Elasticsearch.

For both databases (SQL Server & Elasticsearch), I have created repositories:

Repository for SQL Server:

@Repository
public interface ProductRepository extends ReactiveCrudRepository<ProductTbl, Long> {

}

Repository for Elasticsearch:

@Repository
public interface ProductElasticRepository extends ReactiveElasticsearchRepository<Product, String> {

}

Shouldn't I be able to index all documents by calling productElasticRepository.saveAll(productRepository.findAll())?

It doesn't quite work. Either it exceeds the DataBufferLimit and therefore throws an Exception or there is a ReadTimeOutException. When executing, I can see R2DBC creating RowTokens for each data entry of the SQL Server database. However a POST to the Elasticsearch client does only occur until all rows are obtained, which doesn't seem to be how reactive code should work? I am definitely missing something here and hope someone can help.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
AFUEU
  • 23
  • 4

2 Answers2

0

I cannot figure out what exactly is the problem in your case but I remember I did something similar in the past and I remember a few problems I've encountered.

It depends on how much data you need to migrate. If there are millions of rows then it will definitely fail. In this case you can create an algorithm which is going to have a windows, let's say 5000 rows, read them and then write them to elasticsearch using a batch insert. Then you do this until there are no more rows to read.

Another problem I encountered is that the ElasticSearch WebClient wasn't configured to support the amount of data I was sending in the body.

Another: ElasticSearch has a queue capacity of 200 by default, if you exceed that then you will get an error. This happens if you somehow try to insert the data in parallel.

Another: the connection with the relational database will be interrupted at some point if kept open for a very long time

Remember that elasticsearch is not reactive by default. There is a reactive driver at this point but it is not official.

Another: When doing the migration try to write on a single node with not so many shards.

tzortzik
  • 4,993
  • 9
  • 57
  • 88
  • I thought by using reactive programming, there wouldn't be any need to implement batch processing. Looks like I am wrong with this assumption. – AFUEU Feb 09 '21 at 09:58
0

You should do something like

productRepository.findAll()
        .buffer(1000)
        .onBackpressureBuffer()
        .flatMap(list -> productElasticRepository.saveAll(list))

Also, if you're getting ReadTimeOutException, increase the socket timeout

spring:
  data:
    elasticsearch:
      client:
        reactive:
          socket-timeout: 5m