9

We'd like to use PostgreSQL to store documents. As some of them might be up to 2 GB big, we have to use the lob-Datatype, where the large objects are stored in a separate table (pg_largeobject), referenced by an OID, as per the docs on large objects.

For us it's very important that these documents (.pdf, .doc, ...) can be searched. With the built-in PostgreSQL full text search tables and columns can be searched, but is it possible to search the large-objects in the pg_largeobject table as well?

If not, we have to use Oracle.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
vera
  • 91
  • 1
  • 2
  • Have you considered using only Solr instead of another RDBMS? Solr supports indexing PDF and DOC files + faceted search. You can use Solr as a NoSQL datastore, that is, store the actual PDF/DOC files directly in Solr, as far as I've read. I have no idea about 2 GB files though. – KajMagnus Jul 13 '13 at 09:01
  • You could use Apache Tiki library to get the plain text of the PDF or Word documents, and put them in anothe column. Heck, you could probably do it right in the database using PL/Java – Neil McGuigan Jan 17 '14 at 23:58

2 Answers2

3

There are at least two issues here.

  1. Full-text search doesn't really work on large objects stored as lob or oid-references. You cannot full-text index the contents of pg_largeobject.

  2. Full-text search is an indexing system for text. It cannot index PDF, Microsoft Word documents, or other random binary files. It does not have provision for text-extraction tool callbacks, etc.

You can either:

  • Create a table that contains text extracted from those files using external tools along with an oid that refers to the file its self, then full-text index that table of extracted text; or

  • Use a more powerful, full-featured external search system like Solr (based on Lucene) that's designed to cope with varying formats, do its own text extraction, etc.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778
3

In PostgreSQL, full text search is mostly based on one very important function to_tsvector(). This function converts text document into tsvector data type. Typically, tsvector physical size is about 1% of original text.

All further text search methods are based on creating indexes on top of computed or physical tsvector values. With this in mind, you can compute necessary tsvector values anywhere (even on a client!) and save it in specially created column(s) for indexing purposes. In other words, you don't really have to store your original texts in database - tsvector is all you need to tell which document contained text you are searching for (but it won't tell you location within single document). What's cool about this, is that indexed document can be anything - even Word, Excel, PDF, whatever, as long as you can get textual content of that document and compute necessary tsvector to be stored and indexed.

One small downside of recomputing tsvector's (especially on client side) is that you have to recompute it every time you change your original document.

mvp
  • 111,019
  • 13
  • 122
  • 148