8

I am an apache solr user about a year. I used solr for simple search tools but now I want to use solr with 5TB of data. I assume that 5TB data will be 7TB when solr index it according to filter that I use. And then I will add nearly 50MB of data per hour to the same index.

1- Are there any problem using single solr server with 5TB data. (without shards)

  • a- Can solr server answers the queries in an acceptable time

  • b- what is the expected time for commiting of 50MB data on 7TB index.

  • c- Is there an upper limit for index size.

2- what are the suggestions that you offer

  • a- How many shards should I use

  • b- Should I use solr cores

  • c- What is the committing frequency you offered. (is 1 hour OK)

3- are there any test results for this kind of large data


There is no available 5TB data, I just want to estimate what will be the result.

Note: You can assume that hardware resourses are not a problem.

Community
  • 1
  • 1
Mustafa
  • 146
  • 2
  • 7
  • 1
    A question for you. Assuming you are indexing 5TB of raw data, why do you think it will grow to 7TB? Should I take this to mean that you are storing the full document content in the index as well, as opposed to just storing the search fields? If so, I would suggest only storing what you need for searching in Solr. The raw documents themselves belong elsewhere. – rfeak Jan 14 '12 at 04:10

1 Answers1

3

if your sizes are for text, rather than binary files (whose text would be usually much less), then I don't think you can pretend to do this in a single machine.

This sounds a lot like Logly and they use SolrCloud to handle such amount of data.

ok if all are rich documents then total text size to index will be much smaller (for me its about 7% of my starting size). Anyway, even with that decreased amount, you still have too much data for a single instance I think.

Persimmonium
  • 15,593
  • 11
  • 47
  • 78