SQL Server fulltext can index varbinary and image columns.
You can see the list of all file types currently supported by SQL Server:
SELECT * FROM sys.fulltext_document_types
For example:
| document_type | class_id | path | version | manufacturer |
|---------------|--------------------------------------|----------------------------------------------------------------------------------|-------------------|-----------------------|
| .doc | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll | 2008.0.9200.16384 | Microsoft Corporation |
| .txt | C7310720-AC80-11D1-8DF3-00C04FB6EF4F | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\msfte.dll | 12.0.6828.0 | Microsoft Corporation |
| .xls | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll | 2008.0.9200.16384 | Microsoft Corporation |
| .xml | 41B9BE05-B3AF-460C-BF0B-2CDD44A093B1 | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\xmlfilt.dll | 12.0.9735.0 | Microsoft Corporation |
When creating the varbinary (or image) column to contain your binary file, you must have another string column that gives the file type through its extension (e.g. ".doc")
CREATE TABLE Documents (
DocumentID int IDENTITY,
Filename nvarchar(32000),
Data varbinary(max),
DataType varchar(50) --contains the file extension (e.g. ".docx", ".pdf")
)
When adding the binary column to the full-text index SQL Server needs you to tell it which column contains the data type string:
ALTER FULLTEXT INDEX ON [dbo].[Documents]
ADD ([Data] TYPE COLUMN [DataType])
You can test by importing a binary file from the filesystem on the server:
INSERT INTO Documents(filename, DataType, data)
SELECT
'Managing Storage Spaces with PowerShell.doc' AS Filename,
'.doc', *
FROM OPENROWSET(BULK N'C:\Managing Storage Spaces with PowerShell.doc', SINGLE_BLOB) AS Data
You can view the catalog status using:
DECLARE @CatalogName varchar(50);
SET @CatalogName = 'Scratch';
SELECT
CASE FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus')
WHEN 0 THEN 'Idle'
WHEN 1 THEN 'Full population in progress'
WHEN 2 THEN 'Paused'
WHEN 3 THEN 'Throttled'
WHEN 4 THEN 'Recovering'
WHEN 5 THEN 'Shutdown'
WHEN 6 THEN 'Incremental population in progress'
WHEN 7 THEN 'Building index'
WHEN 8 THEN 'Disk is full. Paused.'
WHEN 9 THEN 'Change tracking'
ELSE 'Unknown'
END+' ('+CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus') AS varchar(50))+')' AS PopulateStatus,
FULLTEXTCATALOGPROPERTY(@CatalogName, 'ItemCount') AS ItemCount,
CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'IndexSize') AS varchar(50))+ ' MiB' AS IndexSize,
CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'UniqueKeyCount') AS varchar(50))+' words' AS UniqueKeyCount,
FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAge') AS PopulateCompletionAge,
DATEADD(second, FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAGe'), 0) AS PopulateCompletionDate
And you can query the catalog:
SELECT * FROM Documents
WHERE FREETEXT(Data, 'Bruce')
Additional IFilters
SQL Server has a limited set of built-in filters. It can also use IFilter
implementations registered on the system (e.g. Microsoft Office 2010 Filter Pack that provides docx, msg, one, pub, vsx, xlsx and zip support).
You must enable the use of the OS-level filters by enabling the option:
sp_fulltext_service 'load_os_resources', 1
and restart the SQL Server service.
load_os_resources int
Indicates whether operating system word breakers, stemmers, and filters are registered and used with this instance of SQL Server. One of:
0: Use only filters and word breakers specific to this instance of SQL Server.
1: Load operating system filters and word breakers.
By default, this property is disabled to prevent inadvertent behavior changes by updates made to the operating system. Enabling use of operating system resources provides access to resources for languages and document types registered with Microsoft Indexing Service that do not have an instance-specific resource installed. If you enable the loading of operating system resources, ensure that the operating system resources are trusted signed binaries; otherwise, they cannot be loaded when verify_signature is set to 1.
If using SQL Server before SQL Server 2008, you must also restart the Full-Text indexing service after enabling this option:
net stop msftesql
net start msftesql
Microsoft provides filter packs contain IFilter
for the Office 2007 file types:
And Adobe provides an IFilter
for indexing PDFs (Foxit provides one, but theirs is not free):
Bonus Reading