How can I validate large numbers of files with search and replace?

Question

I am currently validating a client's HTML Source and I am getting a lot of validation errors for images and input files which do not have the Omittag. I would do it manually but this client literally has thousands of files, with a lot of instances where the is not .

This client has validated some img tags (for whatever reason).

Just wondering if there is a unix command I could run to check to see if the does not have a Omittag to add it.

I have done simple search and replaces with the following command:

find . \! -path '*.svn*' -type f -exec sed -i -n '1h;1!H;${;g;s/<b>/<strong>/g;p}' {} \;

But never something this large. Any help would be appreciated.

Questions ... Are you using GNU sed or some other sed, like BSD? (You can do "man sed" to see which one is on your system.) Are you trying to validate as HTML, i.e. , or XHTML, i.e. ? — joelhardi, Oct 28 '08 at 05:42

Anirvan · Answer 1 · 2008-10-28T06:43:16.970

Try this. It'll go through your files, make a .orig backup of each file (perl's -i operator), and replace <img> and <input> tags with <img /> and <input >.

find . \! -path '*.svn*' -type f -exec perl -pi.orig -e 's{ ( <(?:img|input)\b ([^>]*?) ) \ ?/?> }{$1\ />}sgxi' {} \;

Given input:

<img>  <img/>  <img src="..">  <img src="" >
<input>  <input/>  <input id="..">  <input id="" >

It changes the file to:

<img />  <img />  <img src=".." />  <img src="" />
<input />  <input />  <input id=".." />  <input id="" />

Here's what the regexp is doing:

s{(<(?:img|input)\b ([^>]*?)) # capture "<img" or "<input" followed by non-">" chars
  \ ?/?>}                     # optional space, optional slash, followed by ">"
{$1\ />}sgxi                  # replace with: captured text, plus " />"

score 0 · Accepted Answer · answered Oct 28 '08 at 06:16

See questions I asked in comment at top.

Assuming you're using GNU sed, and that you're trying to add the trailing / to your tags to make XML-compliant <img /> and <input />, then replace the sed expression in your command with this one, and it should do the trick: '1h;1!H;${;g;s/$img\|input$$ [^>]*[^/]$>/\1\2\/>/g;p;}'

Here it is on a simple test file (SO's colorizer doing wacky things):

$ cat test.html
This is an <img tag> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag > without closing slash.
And here one <input attrib="1" 
    > that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

$ sed -n '1h;1!H;${;g;s/\(img\|input\)\( [^>]*[^/]\)>/\1\2\/>/g;p;}' test.html
This is an <img tag/> without closing slash.
Here is an <img tag /> with closing slash.
This is an <input tag /> without closing slash.
And here one <input attrib="1" 
    /> that spans multiple lines.
Finally one <input
  attrib="1" /> with closing slash.

Here's GNU sed regex syntax and how the buffering works to do multiline search/replace.

Alternately you could use something like Tidy that's designed for sanitizing bad HTML -- that's what I'd do if I were doing anything more complicated than a couple of simple search/replaces. Tidy's options get complicated fast, so it's usually better to write a script in your scripting language of choice (Python, Perl) that calls libtidy and sets whatever options you need.

How can I validate large numbers of files with search and replace?

2 Answers2

Linked