3

I am working on contract module for my project. The contract templates are stored as RTF template with many placeholders with syntax as @placeholder_name@. Each campaign entry is associated with a particular contract template at any given time. When the contract for the campaign is requested/downloaded:

  1. the RTF template is read as a variable.
  2. placeholders within the file variable are replaced with the values from the query objects for the campaign.
  3. the variable is then send to the browser for download using cfcontent.

Problem

I need to remove the whole section in the rtf file, if the value for a particular placeholder is empty. For example: the Additional Information section here:

enter image description here

I was able to find out the following rtf block inside the file, which is the whole section of Additional Information above including the rtf table styling.

\par \ltrrow}\trowd \irow0\irowband0\lastrow \ltrrow\ts78\trgaph108\trleft-
810\trbrdrt\brdrdot\brdrw10 \trbrdrl\brdrdot\brdrw10 \trbrdrb\brdrdot\brdrw10 \trbrdrr\brdrdot\brdrw10 \trbrdrh\brdrdot\brdrw10 \trbrdrv\brdrdot\brdrw10     \trftsWidth3\trwWidth11520\trftsWidthB3\trftsWidthA3\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddft3\trpaddfb3\trpaddfr3\tblrsid12942116\tbllkhdrrows\tbllkhdrcols\tbllknocolband\tblind-702\tblindtype3 \clvertalt\clbrdrt\brdrdot\brdrw10 \clbrdrl \brdrdot\brdrw10 \clbrdrb\brdrdot\brdrw10 \clbrdrr\brdrdot\brdrw10 \cltxlrtb\clftsWidth3\clwWidth3510\clshdrawnil \cellx2700\clvertalt\clbrdrt\brdrdot\brdrw10 \clbrdrl\brdrdot\brdrw10 \clbrdrb\brdrdot\brdrw10 \clbrdrr\brdrdot\brdrw10 
\cltxlrtb\clftsWidth3\clwWidth8010\clshdrawnil \cellx10710\pard \ltrpar\ql \li0\ri0\sa200\widctlpar\intbl\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\pararsid4544034 {\rtlch\fcs1 \af1 \ltrch\fcs0 \insrsid3568873 Additional Information}{
\rtlch\fcs1 \af1 \ltrch\fcs0 \insrsid4544034 \cell }\pard \ltrpar\ql \li0\ri0\sa200\widctlpar\intbl\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0 {\rtlch\fcs1 \af1 \ltrch\fcs0 \insrsid3568873\charrsid4544034 @additional_contract_info@}{
\rtlch\fcs1 \af1 \ltrch\fcs0 \insrsid4544034 \cell }\pard \ltrpar\ql \li0\ri0\widctlpar\intbl\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0 {\rtlch\fcs1 \af1 \ltrch\fcs0 \insrsid4544034 \trowd \irow0\irowband0\lastrow \ltrrow
\ts78\trgaph108\trleft-810\trbrdrt\brdrdot\brdrw10 \trbrdrl\brdrdot\brdrw10 \trbrdrb\brdrdot\brdrw10 \trbrdrr\brdrdot\brdrw10 \trbrdrh\brdrdot\brdrw10 \trbrdrv\brdrdot\brdrw10 
\trftsWidth3\trwWidth11520\trftsWidthB3\trftsWidthA3\trautofit1\trpaddl108\trpaddr108\trpaddfl3\trpaddft3\trpaddfb3\trpaddfr3\tblrsid12942116\tbllkhdrrows\tbllkhdrcols\tbllknocolband\tblind-702\tblindtype3 \clvertalt\clbrdrt\brdrdot\brdrw10 \clbrdrl
\brdrdot\brdrw10 \clbrdrb\brdrdot\brdrw10 \clbrdrr\brdrdot\brdrw10 \cltxlrtb\clftsWidth3\clwWidth3510\clshdrawnil \cellx2700\clvertalt\clbrdrt\brdrdot\brdrw10 \clbrdrl\brdrdot\brdrw10 \clbrdrb\brdrdot\brdrw10 \clbrdrr\brdrdot\brdrw10 
\cltxlrtb\clftsWidth3\clwWidth8010\clshdrawnil \cellx10710\row }\pard \ltrpar\ql \li0\ri0\sa200\widctlpar\wrapdefault\aspalpha\aspnum\faauto\adjustright\rin0\lin0\itap0 {\rtlch\fcs1 \af1 \ltrch\fcs0 \insrsid4544034 
\par }

I have been working on it to find a solution for days now. What I need is a regular expression statement in ColdFusion to find the block of \par control word wrapped around the placeholder @additional_contract_info@ i.e Only the parent para for the placeholder:

the portion: "\par ...@additional_contract_info@ ...." until the ending \par

Assuming the paras are not nested.

I am not very proficient with Regular expression and I tried googling and searching SO for all types of related questions but couldn't solve it. I need help with it!

Anurag
  • 1,018
  • 1
  • 14
  • 36
  • Do you want to find content between `@additional_contract_info@` placeholder (used as start an end marker) from the section you are able to find? – Pankaj Nov 10 '15 at 11:36
  • 1
    I initially used RTF files too. If the templates are modified, the newly generated code may not work if invisible mark-up is added between your placeholders. I've switched to generating MSHTML (HTML w/some minor Microsoft-only HTML) and saving the file with a DOC extension. (NOTE: Using this method enables online editing of the mergeable templates using CKEditor. JSOUP could also be integrated to add/remove entire sections.) – James Moberg Nov 10 '15 at 14:42
  • @JamesMoberg I really appreciate the suggestion. I will study on MSHTML templating. I would to stick to rtf as it is a business decision for the client to make. – Anurag Nov 10 '15 at 19:45
  • 1
    @Anurag Once I showed the quality & ease of updating, my client was eager to switch. RTF files have no security, so just like an HTML file, the contents can be modified by anyone. Using HTML means that the pseudo-DOC file can be easily converted to PDF on-the-fly using CFDocument or displayed in the browser. (Both of these features aren't easily performed by ColdFusion if the file is RTF.) To generate the initial MSHTML file, save any file as HTML and then rename the extension to DOC. I did this so I could identify basic orientation, margin & header/footer rules. – James Moberg Nov 11 '15 at 00:13
  • I will definitely suggest the many benefits of MSHTML templating, thank you for the insight @James Moberg :). I hope the client decides to make a move. – Anurag Nov 11 '15 at 05:30

1 Answers1

3

Try:

\\par\b((?!\\par\b).)*@additional_contract_info@.*?\\par\b
  • \b matches a word boundary so you don't match \pard.
  • (?!\\par\b). will first do a negative look-ahead to ensure that there are no other instances of \par between the start of the match and the @ then will consume a single character. Repeating this will match the entire string between the most recent \par and the @.
  • After the final @ you can just use a non-greedy wildcard match .*? (so it will only match the minimum number of characters) to find the ending paragraph code.

Example:

<cfscript>
  str = '\par \par \pard text \par \pard text @additional_contract_info@ text \pard \par text \pard \par } \par }';
  output = REReplace( str, '\\par\b((?!\\par\b).)*@additional_contract_info@.*?\\par\b', '' );
  WriteOutput( output );
</cfscript>

Should output:

\par \par \pard text  text \pard \par } \par }

Update:

You can also try doing it without regular expressions:

<cfscript>
  str      = '\par \par \pard text \par \pard text @additional_contract_info@ text \pard \par text \pard \par } \par }';
  pos      = find( '@additional_contract_info@', str );
  endPos   = find( '\par ', str, pos ) + 4;
  startPos = left( str, pos ).lastIndexOf( '\par ' );
  output   = left( str, startPos ) & right( str, len( str ) - endPos + 1 );
  WriteOutput( output );
</cfscript>

(Note: this assumes that you will always find \par with a trailing space, whereas the Regular Expression looks for word boundaries, if this isn't the case then you might need to try other ways of finding the boundaries of the text to remove).

MT0
  • 143,790
  • 11
  • 59
  • 117
  • I just implemented the solution on the test server....I witnessed a `java.lang.StackOverflowError` exception and the exception.log mentioned source at `org.apache.oro.text.regex.Perl5Matcher.__match(Unknown Source)` This exception was only fired for big files for me 1.5Mb+ files. I tried upping the stack settings and memory for the server, but still no good. Can you share your thought on this? – Anurag Nov 13 '15 at 07:49
  • @Anurag Updated with a non-regexp solution that will (hopefully) not have those issues. – MT0 Nov 13 '15 at 10:03