-1

what is saved-search?

Save is the mechanism users don't find their desired results in advanced search and just push "Save My Search Criteria bottom" and we save the search criteria and when corresponding data post to website we will inform the user "hey user, the item(s) you were looking for exists now come and visit it".

Saved Searches is useful for sites with complex search options, or sites where users may want to revisit or share dynamic sets of search results.

we have advanced search and don't need to implement new search, what we require is a good performance scenario to achieve saved-search mechanism.

we have a website that users post about 120,000 posts per day into the website and we are going to implement SAVED SEARCH scenario(something like this what https://www.gumtree.com/ do), it means users using advanced search but they don't find their desired content and just want to save the search criteria and if there will be any results in the website we inform them with notification.

We are using Elastic search and Mysql in our Website.We still, haven't implement anything and just thinking about it to find good solution which can handle high rate of date, in other hand **the problem is the scale of work, because we have a lot of posts per day and also we guess users use this feature a lot, So we are looking for good scenario which could handle this scale of work easy with high performance.

suggested solutions but not the best

  • one quick solution is we save the saved-searches in saved-search-index in Elastic then run a cronjob that for all saved-searches items get results from posts-index- Elastic and if there is any result push a record into the RabbitMq to notify the equivalent user.

  • on user post an item into the website we check it with exists saved-searches in saved-search-index in Elastic and if matched we put a record into the RabbitMq,( the main problem of this method is it could be matched with a huge number of saved-searches in every post inserted into the website).

My big concern is about scale and performance, I'll appreciate sharing your experiences and ideas about this problem with me.

My estimation about the scale

  • Expire date of saved-search is three month
  • at least 200,000 Saved-search Per day
  • So we have 9,000,000 active Records

I'll appreciate if you share your mind with me

*just FYI** - we also have RabbitMQ for our queue jobs - our ES servers are good enough with 64GB RAM

Yuseferi
  • 7,931
  • 11
  • 67
  • 103
  • are you saving search result on database – shashi Oct 04 '17 at 09:18
  • @shashi we are in planning step , but yes, saving save search query string in mysql is a part of solution. – Yuseferi Oct 04 '17 at 09:28
  • don't you think its almost same as actual search on database – shashi Oct 04 '17 at 09:30
  • 1
    instead lazy loading actual search is a quicker and efficient – shashi Oct 04 '17 at 09:31
  • @shashi , do you know the concept of save search? read this to understand concept of save search . https://pages.ebay.com/help/buy/searches-follow.html – Yuseferi Oct 04 '17 at 09:43
  • You can filter the queries so you don't have to execute your estimated 6 million queries for every new post (or every several thousand new posts depending on your desired search frequency). – Rei Oct 08 '17 at 21:46
  • @Rei filter according to what? I seems you don't know the scenario of saved-search – Yuseferi Oct 09 '17 at 05:33
  • From your description, it sounds like Google Alerts, only the alert is on new content that entered your database. Am I wrong? – Rei Oct 09 '17 at 17:50
  • @Rei , yes exactly. – Yuseferi Oct 09 '17 at 18:14
  • Is that a confirmation that I am wrong? – Rei Oct 09 '17 at 18:21
  • @Rei, No, you right, exactly what we want is something like google alert, user save it's advanced search criteria, and when a new matched data available for him/her we'll inform him/her. – Yuseferi Oct 09 '17 at 18:45
  • "at least 200,000 Saved-search Per day" -- 200K new ones per day? 200K need to be performed per day? What? – Rick James Oct 11 '17 at 12:46
  • @RickJames yes, yes, we have daily 2M unique visits with 15 page per session. you consider 6M active saved-search . – Yuseferi Oct 11 '17 at 21:09
  • 42 million added to the list per week? And how often do you want to re-evaluate each of them? How many ES machines are you running? – Rick James Oct 11 '17 at 23:41
  • @RickJames 4.2Milion per week, we have 2 clusters, 8 Data Node, 4 Master Node. each machine has 64 GB RAM and powerful CPU. – Yuseferi Oct 12 '17 at 09:30
  • With your current server configuration, how many queries can you execute per second? 1000? 2000 per second? – Rei Oct 30 '17 at 18:09

4 Answers4

1

Cron job - No. Continual job - yes.

Why? As things scale, or as activity spikes, cron jobs become problematical. If the cron job for 09:00 runs too long, it will compete for resources with the 10:00 instance; this can cascade into a disaster.

At the other side, if a cron job finishes 'early', then the activity oscillates between "busy" (the cron job is doing stuff) and "not busy" (cron has finished, and not time for next invocation).

So, instead, I suggest a job that continually runs through all the "stored queries", doing them one at a time. When it finishes the list, is simply starts over. This completely eliminates my complaints about cron, and provides an automatic "elasticity" to handle busy/not-busy times -- the scan will slow down or speed up accordingly.

When the job finishes, the list, it starts over on the list. That is, it runs 'forever'. (You could use a simple cron job as a 'keep-alive' monitor that restarts it if it crashes.)

OK, "one job" re-searching "one at a time" is probably not best. But I disagree with using a queuing mechanism. Instead, I would have a small number of processes, each acting on some chunk of the stored queries. There are many ways: grab-and-lock; gimme a hundred to work on; modulo N; etc. Each has pros and cons.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • is it feasible? the minimum 3M record will be stored in MySQL, run a cronjob, for each of them run it's query string in Elastic Search and if there is any result in results of Elastic notify the user. – Yuseferi Oct 08 '17 at 18:25
  • 1
    No cron jobs. Just one (or more) continually running jobs. (OK, you could use cron simply as a "keep-alive" to restart the jobs if they die.) I added a paragraph to that effect. – Rick James Oct 08 '17 at 18:27
  • minimum 6M record and 6 million query to ES, is it make sense? I'm looking another solution with better performance. – Yuseferi Oct 09 '17 at 05:35
  • @zhilevan For this amount you have to queue and process the queries as quickly as possible. As I wrote in another comment: Write a service that accepts the query, ask ES, writes the result into the database, notifies the user. And separate it from the web server. Choose a language that supports asynchronous communication very well. – bato3 Oct 09 '17 at 13:22
  • 6M queries per fortnight? per second? per year? I am suggesting optimizing the processing, then accepting the rate you can get. – Rick James Oct 09 '17 at 21:29
  • OK, a large number of ES servers to handle a large number of searches. I still think "continual", not "cron" is the way to go. But all I can say about speed is "as fast as possible. – Rick James Oct 12 '17 at 23:50
1

Because you are already using Elasticsearch and you have confirmed that you are creating something like Google Alerts, the most straightforward solution would be Elasticsearch Percolator.

From the official documentation, Percolator is useful when:

You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.

I can't say much when it comes to performance, because you did not provide any example of your queries but mostly because my findings are inconsistent.

According to this post (https://www.elastic.co/blog/elasticsearch-queries-or-term-queries-are-really-fast), Elasticsearch queries should be capable of reaching 30,000 queries/second. However, this unanswered question (Elasticsearch percolate performance) reported a painfully slow 200 queries/second on a 16 CPU server.

With no additional information I can only guess that the cause is configuration problems, so I think you'll have to try a bunch of different configurations to get the best possible performance. Good luck!

patrick
  • 183
  • 6
0

This answer was written without a true understanding of the implications of a "saved search". I leave it here as discussion of a related problem, but not as a "saved search" solution. -- Rick James

If you are saving only the "query", I don't see a problem. I will assume you are saving both the query and the "resultset"...

One "saved search" per second? 2.4M rows? Simply rerun the search when needed. The system should be able to handle that small a load.

Since the data is changing, the resultset will become outdated soon? How soon? That is, saving the resultset needs to be purged rather quickly. Surely the data is not so static that you can wait a month. Maybe an hour?

Actually saving the resultset and being able to replay it involves (1) complexity in your code, (2) overhead in caching, I/O, etc, etc.

What is the average number of times that the user will look at the same search? Because of the overhead I just mentioned, I suspect the average number of times needs to be more than 2 to justify the overhead.

Bottomline... This smells like "premature optimization". I recommend

  1. Build the site without saving resultsets.
  2. Stress test it to see when it will break.
  3. Work on optimizing the slow parts.

As for RabbitMQ -- "Don't queue it, just do it". The cost of queuing and dequeuing is (1) increased latency for the user and (2) increased overhead on system. The benefit (at your medium scale) is minimal.

If you do hit scaling problems, consider

  • Move clients off to another server -- away from the database. This will give you some scaling, but not 2x. To go farther...
  • Use replication: One Master + many readonly Slaves -- and do the queries on the Slaves. This gives you virtually unlimited scaling in the database.
  • Have multiple web servers -- virtually unlimited scaling in this part.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • thanks for your attention and time, we have a website, we have advanced search, the only thing we want to implement is the saved-search mechanism that let the user save their search criteria and we will inform him/her when a new corresponded item available on our site. I think you don't got my mean about saved-search. – Yuseferi Oct 08 '17 at 17:44
  • 1
    True, I did not understand the question. Now it gets a lot messier -- Either rerun the query repeatedly, or somehow drill into the data to discover which query results will change due to the addition of an item. Either way, quite costly. – Rick James Oct 08 '17 at 18:07
  • Yes, exactly, our bottle need is run big query periodically and notify their users if there is data with their saved search criteria. – Yuseferi Oct 08 '17 at 18:17
-2

I don't understand why you want to use saved-search... First: you should optimize service, so as to use as little as possible the saved-search.

Have you done anything with the ES server? (What can you afford), so:

  1. Have you optimized elasticearch server? By default, it uses 1GB of RAM. The best solution is to give him half the machine RAM, but no more than 16GB (if I'm remember. Check doc)
  2. How powerful is the ES machine? He likes core instead of MHZ.
  3. How many ES nodes do you have? You can always add another machine to get the results faster.
  4. In my case (ES 2.4), the server slows down after a few days, so I restart it once a day.

And next:

  1. Why do you want to fire up tasks every half hour? If you already use cron, fire then every minute, and you indicate that the query is running. With the other the post you have a better solution and an explanation.
  2. Why do you separate the result from the query?
  3. Remember to standardize the query to change the order of the parameters, not to force a new query.
  4. Why do you want to use MySQL to store results? The better document-type database, like Elasticsearch xD.

I propose you:

  1. Optimize ES structure - choose right tokenisers for fields.
  2. Use asynchronous loading of results - eg WebSocket + Node.js
bato3
  • 2,695
  • 1
  • 18
  • 26
  • Thanks for your attention, by the way, I'm looking for a solution, a scenario could handle high-scale level, our elastic Server is good, (64 Ram ), can you provide a strategy which handles mentioned situation? – Yuseferi Oct 09 '17 at 12:18
  • @zhilevan But you insisted on these saved-queries. My answer is: do everything to avoid this. ES is designed to work in a cluster. Its recommended configuration is 2 nodes of data. If you want better performance you add another machine. And I did not ask about the machine, only about whether you did everything possible with ES configuration? How much do you have in `ES_HEAP_SIZE`? I can not imagine asking for your machine (and the amount of data) that would last longer than 5 sec. And it deserves the message: "Preparing the results" and loading them with WS. – bato3 Oct 09 '17 at 12:55
  • @zhilevan:The use of saved-search is only an additional problem. You need to handle the search for saved results (use query-hash or store in ES), query queue (preferably the service who receives the query and saves the results in the DB), user notification (preferably integrated with the previous service), garbage-colletor results. And convince users that they should use it instead of looking for pages where results are available immediately. I suggest using several data nodes, and 1 master, without the data on the machine, where the web server is running. (Binary protocol is faster than REST) – bato3 Oct 09 '17 at 13:12