We are having n number of documents. Upon submission of new document by user, our goal is to inform him about possible duplication of existing document (just like stackoverflow suggests questions may already have answer).
In our system, new document is uploaded every minute and mostly about the same topic (where there are more chance of duplication).
Our current implementation includes gensim doc2vec model trained on documents (tagged with unique document ids). We infer vector for new document and find most_similar docs (ids) with it. Reason behind choosing doc2vec model is that we wanted to take advantage of semantics to improve results. As far as we know, it does not support online training, so we might have to schedule a cron or something that periodically updates the model. But scheduling cron will be disadvantageous as documents come in a burst. User may upload duplicates while model is not yet trained for new data. Also given huge amount of data, training time will be higher.
So i would like to know how such cases are handled in big companies. Are there any better alternative? or better algorithm for such problem?