1

I am trying to remove MSWord formatting information from my text area but not getting idea how to do this. The situation is like I need to copy paste some content from MSWord into a textbox editor. It gets copied well but the issue is that all the formatting also gets copied and so my 300 character sentence expands to 20000 character formatted sentence. Can any one suggest me what to do?

Ok with some R&D done I have reached a certain stage.

Here's the text that I copied from Word document

Once the user clicks on the Cancel icon for a transaction on the Status of Business, and the transaction is eligible for cancellation, a new screen titled “Cancel Transaction” will appear, with the following fields: 

here's what I get in $("#textAreaId").val()

"

  Normal
  0




  false
  false
  false

  EN-US
  X-NONE
  X-NONE




























Once the user clicks on the Cancel icon for a
transaction on the Status of Business, and the transaction is eligible for
cancellation, a new screen titled “Cancel Transaction” will appear, with the
following fields: 



 /* Style Definitions */
 table.MsoNormalTable
    {mso-style-name:"Table Normal";
    mso-style-parent:"";
    line-height:115%;
    font-:11.0pt;"Calibri","sans-serif";
    mso-bidi-"Times New Roman";}

"
Gautam
  • 1,728
  • 8
  • 32
  • 67
  • Can you add the text that should be displayed please – skyfoot May 07 '13 at 11:09
  • text may be anything.. actually the text that i put in the sample above is just the formatting..and it was quite huge.. so i just put a chunk there. The real text that I need to display was way down the page – Gautam May 07 '13 at 11:10
  • I want to help you but I don't want to deciphers the example you have given to see what should be displayed. I want to see what characters need to be removed – skyfoot May 07 '13 at 11:13
  • can I have your mail id please?? that would be quite easy – Gautam May 07 '13 at 11:14
  • This may be useful: http://stackoverflow.com/questions/2875027/clean-microsoft-word-pasted-text-using-javascript – András Szepesházi May 07 '13 at 12:14
  • Updated the question with text that I copied and the text value that I get in jquery – Gautam May 08 '13 at 06:05

1 Answers1

7

I finally found the solution here is it

// removes MS Office generated guff
function cleanHTML(input) {
  // 1. remove line breaks / Mso classes
  var stringStripper = /(\n|\r| class=(")?Mso[a-zA-Z]+(")?)/g; 
  var output = input.replace(stringStripper, ' ');
  // 2. strip Word generated HTML comments
  var commentSripper = new RegExp('<!--(.*?)-->','g');
  var output = output.replace(commentSripper, '');
  var tagStripper = new RegExp('<(/)*(meta|link|span|\\?xml:|st1:|o:|font)(.*?)>','gi');
  // 3. remove tags leave content if any
  output = output.replace(tagStripper, '');
  // 4. Remove everything in between and including tags '<style(.)style(.)>'
  var badTags = ['style', 'script','applet','embed','noframes','noscript'];

  for (var i=0; i< badTags.length; i++) {
    tagStripper = new RegExp('<'+badTags[i]+'.*?'+badTags[i]+'(.*?)>', 'gi');
    output = output.replace(tagStripper, '');
  }
  // 5. remove attributes ' style="..."'
  var badAttributes = ['style', 'start'];
  for (var i=0; i< badAttributes.length; i++) {
    var attributeStripper = new RegExp(' ' + badAttributes[i] + '="(.*?)"','gi');
    output = output.replace(attributeStripper, '');
  }
  return output;
}
Gautam
  • 1,728
  • 8
  • 32
  • 67
  • Great answer, but it does leave the occasional extraneous space in opening tags (e.g. `

    `). It could be improved by adding the following before the final return statement: `while (output.indexOf(' >') >= 0) { output = output.replace(' >', '>'); }`

    – Michael Jan 04 '22 at 20:52