0

Here is a sample header define under rfc822, rfc2822 and MIME Now I want to create full text search using lucene. If I use standard analyzer it will create too many useless tokens which will degrade performance. Is there any way we can create good tokens by writing custom analyzer & tokenizer.

From webmaster@email.marketingmag.ca

Microsoft Mail Internet Headers Version 2.0

Received: from sdlasd02.medicis.com ([172.23.163.35]) by mpc-exchange.medicis.com with

Microsoft SMTPSVC(6.0.3790.3959); Mon, 1 Jun 2009 04:30:59 -0700

Received: from mail pickup service by sdlasd02.medicis.com with Microsoft SMTPSVC; Mon, 1 Jun 2009 04:30:59 -0700

Received: from SDLMAIL01.medicis.com ([98.175.1.32]) by sdlasd02.medicis.com with Microsoft SMTPSVC(6.0.3790.1830); Mon, 1 Jun 2009 04:30:59 -0700

Return-Path: bo-buhbpmfbpgh9f6axbzpa2ae1achzvh@b.email.marketingmag.ca

X-CTCH-ID: CFBA793F-FB3C-4DEB-A504-C6165B493680

X-CTCH-RefID: str=0001.0A090202.4A23BBF3.009A,ss=1,fgs=0

X-CTCH-Action: Ignore

Princesh
  • 358
  • 5
  • 6

1 Answers1

0

you typically add one field per header you are intersted in keeping (like Date, Message-ID, From: etc) and ignore the rest. Each field will be of the relevant type, and analyzed accordingly

Persimmonium
  • 15,593
  • 11
  • 47
  • 78
  • Thank you for quick response but even though u stored per filed its value can have junk. Also RFC2822 defines just states any ASCII characters. – Princesh Oct 17 '12 at 23:40