1

I have html file that have tags for binary data like:

<HTML>
  <BODY STYLE="font: 10pt Times New Roman, Times, Serif">
    <TEXT>
      begin 644 image_002.jpg
        M_]C_X  02D9)1@ ! 0   0 !  #_VP!#  @&!@<&!0@'!P<)"0@*#!0-# L+
        M#!D2$P\4'1H?'AT:'!P@)"XG("(L(QP<*#<I+# Q-#0T'R<Y/3@R/"XS-#+_
        MVP!# 0D)"0P+#!@-#1@R(1PA,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R,C(R
       ,Z4]1]: %HHHIB/_9
    end
   </TEXT>
   <TEXT>losses occurring in the third quarter and from weather  </TEXT>
  </BODY>
</HTML>

so I am trying to remove all "TEXT" tags those have binary data using Java Regex. I tried Jsoup library But it only remove html tags. I saw the same question here. But it is not using Java Regex.

Is any standard way to remove this binary data from html file?

Avinash Anand
  • 655
  • 2
  • 15
  • 25
Sky
  • 2,509
  • 1
  • 19
  • 28

2 Answers2

1

It is well know that you shouldn't use a regex to handle xhtml.

I would use jsoup to remove the whole tag and later add it empty.

But if you want to use a regex, then you can use a regex like this:

"your html here".replaceAll("(?s)<TEXT>.*?<\\/TEXT>", "<TEXT></TEXT>")

Working demo

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • Thanks for your help. I am trying to remove only "" which have binary data. Not all tags.It is not working. – Sky May 03 '18 at 05:56
  • @Sky ok, update your question accordingly with sample of valid and non valid tags so I can update the answer to help you – Federico Piazza May 03 '18 at 13:31
1
   val regex =  """<TEXT>\s*begin \d+ (?>[^e]+|e(?!nd\s*<\/TEXT>))*end\s*<\/TEXT>"""

Full example available here

Sky
  • 2,509
  • 1
  • 19
  • 28