31

I am using a 'contenteditable' <div/> and enabling PASTE.

It is amazing the amount of markup code that gets pasted in from a clipboard copy from Microsoft Word. I am battling this, and have gotten about 1/2 way there using Prototypes' stripTags() function (which unfortunately does not seem to enable me to keep some tags).

However, even after that, I wind up with a mind-blowing amount of unneeded markup code.

So my question is, is there some function (using JavaScript), or approach I can use that will clean up the majority of this unneeded markup?

Todd Main
  • 28,951
  • 11
  • 82
  • 146
OneNerd
  • 6,442
  • 17
  • 60
  • 78
  • best of luck with this... the content generated from Word (both in pasting, and save as HTML leaves much to be desired) ;-) – scunliffe May 20 '10 at 15:10
  • I asked more or less the same question back then in http://stackoverflow.com/questions/391291/how-do-i-remove-word-markup-crap-when-inserting-to-a-form , but your title is better. Although, why limit yourself to javascript and not consider doing this on the server? – Adriano Varoli Piazza May 20 '10 at 18:46

10 Answers10

24

Here is the function I wound up writing that does the job fairly well (as far as I can tell anyway).

I am certainly open for improvement suggestions if anyone has any. Thanks.

function cleanWordPaste( in_word_text ) {
 var tmp = document.createElement("DIV");
 tmp.innerHTML = in_word_text;
 var newString = tmp.textContent||tmp.innerText;
 // this next piece converts line breaks into break tags
 // and removes the seemingly endless crap code
 newString  = newString.replace(/\n\n/g, "<br />").replace(/.*<!--.*-->/g,"");
 // this next piece removes any break tags (up to 10) at beginning
 for ( i=0; i<10; i++ ) {
  if ( newString.substr(0,6)=="<br />" ) { 
   newString = newString.replace("<br />", ""); 
  }
 }
 return newString;
}

Hope this is helpful to some of you.

benomatis
  • 5,536
  • 7
  • 36
  • 59
OneNerd
  • 6,442
  • 17
  • 60
  • 78
3

You can either use the full CKEditor which cleans on paste, or look at the source.

Todd Main
  • 28,951
  • 11
  • 82
  • 146
3

I am using this:

$(body_doc).find('body').bind('paste',function(e){
                var rte = $(this);
                _activeRTEData = $(rte).html();
                beginLen = $.trim($(rte).html()).length; 

                setTimeout(function(){
                    var text = $(rte).html();
                    var newLen = $.trim(text).length;

                    //identify the first char that changed to determine caret location
                    caret = 0;

                    for(i=0;i < newLen; i++){
                        if(_activeRTEData[i] != text[i]){
                            caret = i-1;
                            break;  
                        }
                    }

                    var origText = text.slice(0,caret);
                    var newText = text.slice(caret, newLen - beginLen + caret + 4);
                    var tailText = text.slice(newLen - beginLen + caret + 4, newLen);

                    var newText = newText.replace(/(.*(?:endif-->))|([ ]?<[^>]*>[ ]?)|(&nbsp;)|([^}]*})/g,'');

                    newText = newText.replace(/[·]/g,'');

                    $(rte).html(origText + newText + tailText);
                    $(rte).contents().last().focus();
                },100);
            });

body_doc is the editable iframe, if you are using an editable div you could drop out the .find('body') part. Basically it detects a paste event, checks the location cleans the new text and then places the cleaned text back where it was pasted. (Sounds confusing... but it's not really as bad as it sounds.

The setTimeout is needed because you can't grab the text until it is actually pasted into the element, paste events fire as soon as the paste begins.

Daniel Sellers
  • 750
  • 4
  • 7
2

How about having a "paste as plain text" button which displays a <textarea>, allowing the user to paste the text in there? that way, all tags will be stripped for you. That's what I do with my CMS; I gave up trying to clean up Word's mess.

Josh
  • 10,961
  • 11
  • 65
  • 108
  • This would be my worst-case scenario I suppose (and the way its looking, may be the only scenario - very depressing). – OneNerd May 20 '10 at 15:38
  • @OneNerd: I marked your question as a favorite because if anyone else has a better solution I think I'll use it too! – Josh May 20 '10 at 18:12
  • i came up with something I *think* may be usable -- see my answer (and improve it too plz) if you would like. Thanks - – OneNerd May 20 '10 at 18:17
  • Wouldn't this be like sticking a puppy dog's nose in the mess he made on the carpet? – cmc Dec 06 '12 at 16:38
1

You can do it with regex

  1. Remove head tag

  2. Remove script tags

  3. Remove styles tag

    let clipboardData = event.clipboardData || window.clipboardData;
    let pastedText = clipboardData.getData('text/html');
    pastedText = pastedText.replace(/\<head[^>]*\>([^]*)\<\/head/g, '');
    pastedText = pastedText.replace(/\<script[^>]*\>([^]*)\<\/script/g, '');
    pastedText = pastedText.replace(/\<style[^>]*\>([^]*)\<\/style/g, '');
    // pastedText = pastedText.replace(/<(?!(\/\s*)?(b|i|u)[>,\s])([^>])*>/g, '');
    

here the sample : https://stackblitz.com/edit/angular-u9vprc

0

This works great to remove any comments from HTML text, including those from Word:

function CleanWordPastedHTML(sTextHTML) {
  var sStartComment = "<!--", sEndComment = "-->";
  while (true) {
    var iStart = sTextHTML.indexOf(sStartComment);
    if (iStart == -1) break;
    var iEnd = sTextHTML.indexOf(sEndComment, iStart);
    if (iEnd == -1) break;
    sTextHTML = sTextHTML.substring(0, iStart) + sTextHTML.substring(iEnd + sEndComment.length);
  }
  return sTextHTML;
}
user759463
  • 73
  • 3
  • 7
0

I did something like that long ago, where i totally cleaned up the stuff in a rich text editor and converted font tags to styles, brs to p's, etc, to keep it consistant between browsers and prevent certain ugly things from getting in via paste. I took my recursive function and ripped out most of it except for the core logic, this might be a good starting point ("result" is an object that accumulates the result, which probably takes a second pass to convert to a string), if that is what you need:

var cleanDom = function(result, n) {
var nn = n.nodeName;
if(nn=="#text") {
    var text = n.nodeValue;

    }
else {
    if(nn=="A" && n.href)
        ...;
    else if(nn=="IMG" & n.src) {
        ....
        }
    else if(nn=="DIV") {
        if(n.className=="indent")
            ...
        }
    else if(nn=="FONT") {
        }       
    else if(nn=="BR") {
        }

    if(!UNSUPPORTED_ELEMENTS[nn]) {
        if(n.childNodes.length > 0)
            for(var i=0; i<n.childNodes.length; i++) 
                cleanDom(result, n.childNodes[i]);
        }
    }
}
rob
  • 9,933
  • 7
  • 42
  • 73
0

Had a similar issue with line-breaks being counted as characters and I had to remove them.

$(document).ready(function(){

  $(".section-overview textarea").bind({
    paste : function(){
    setTimeout(function(){
      //textarea
      var text = $(".section-overview textarea").val();
      // look for any "\n" occurences and replace them
      var newString = text.replace(/\n/g, '');
      // print new string
      $(".section-overview textarea").val(newString);
    },100);
    }
  });
  
});
ericmotil
  • 19
  • 2
-1

Could you paste to a hidden textarea, copy from same textarea, and paste to your target?

souLTower
  • 181
  • 1
  • 3
  • hmm - well, do you know a way to send the pasted content to a textarea so it is indeed plain text instead of the markup code -- since the keypress is on the DIV, I can read the contents and pass it to the textarea, but it wouldn't be plaintext. – OneNerd May 20 '10 at 15:40
  • I think that leaving the stuff as only text is not the best solution. The format is important. I work in an application that my customers doesn't want the styles from word to be removed. – Raul Luna Jul 25 '14 at 12:47
-4

Hate to say it, but I eventually gave up making TinyMCE handle Word crap the way I want. Now I just have an email sent to me every time a user's input contains certain HTML (look for <span lang="en-US"> for example) and I correct it manually.

Amy B
  • 17,874
  • 12
  • 64
  • 83