How to parse lots of html file with linux command or bash script

Question

I am mirroring a web site using the wget command. And I wrote a script for that. The script takes a replica of the site through crontab every day. The mirror is stored under /var/www for accessing from browser (localhost). But I want to remove user input areas like login or search from the HTML files. I can parse the files manually, but I want to make parsing with a script. Can you help me?

Please describe in more detail how you want to alter the HTML files. Also, please show us what you have tried so far. — , Mar 06 '14 at 13:26
Are there any specific _tags_ you want to remove? Or the other way around, is there a specific _tag_ you want to keep? e.g.: keeping only the text between
and — jimm-cl, Mar 06 '14 at 13:27
I want to remove all tags that contain "login" or "search".I make this with using cat,grep,rm and mv for one html file but I want to make this for all html file with script. — user3388268, Mar 06 '14 at 13:47
possible duplicate of [Linux command line global search and replace](http://stackoverflow.com/questions/471183/linux-command-line-global-search-and-replace) — tripleee, Mar 12 '14 at 07:20

score 0 · Accepted Answer · answered Mar 12 '14 at 06:42

0

May be you are looking for something like this

cat your-html | sed -e 's/\<input.*type="text".*\>//g' | sed -e 's/\<input.*type="password".*\>//g' > new.html

answered Mar 12 '14 at 06:42

iamsrijon

76
6

1

Thanks for the tip,it helped me a lot.I want to do this for all html files in the directory,I try "find . -name "*.html" ; while" command before that,but it didnt work.Is there any way to run this command for all html files in the directory?thanks in advence. – user3388268 Mar 14 '14 at 07:45

score 0 · Answer 2 · answered Mar 12 '14 at 07:16

0

Because you are not telling us what to fix, we can't help you with the specifics, but to remove foo and </bar> anywhere in the tree of HTML files, something like

find /var/www/mirror.example.com -type f -name '*.html' \
    -exec sed -i 's/foo//;s%</bar>%%' {} \;

If your find supports \+ instead of \; this can be made somewhat more efficient.

answered Mar 12 '14 at 07:16

tripleee

175,061
34
275
318

1

This is one of the most frequently asked questions here; if you need more help, go look at the 10,000 near-duplicates. – tripleee Mar 12 '14 at 07:17

score 0 · Answer 3 · edited Apr 13 '17 at 12:51

You may use Ex editor to edit the html page in-place, for example:

ex -V1 $PAGE <<-EOF
  " Correcting missing protocol, see: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2359 "
  %s,'//,'http://,ge
  %s,"//,"http://,ge
  " Correcting relative paths, see: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2359 "
  %s,[^,]\zs'/\ze[^>],'http://www.example.com/,ge
  %s,[^,]\zs"/\ze[^>],"http://www.example.com/,ge
  " Remove the margin on the left of the main block. "
  %s/id="doc_container"/id="doc_container" style="min-width:0px;margin-left : 0px;"/g
  %s/<div class="outer_page/<div style="margin: 0px;" class="outer_page/g
  " Remove useless html elements. "
  /<div.*id="global_header"/norm nvatd
  /<div class="header_spacer"/norm nvatd
  /<div.*id="doc_info"/norm nvatd
  /<div.*class="toolbar_spacer"/norm nvatd
  /<div.*between_page_ads_1/norm nvatd
  /id="leaderboard_ad_main">/norm nvatd
  /class="page_missing_explanation/norm nvatd
  /<div id="between_page_ads/norm nvatd
  /<div class="b_..">/norm nvatd
  /<div class="shadow_overlay">/norm nvatd
  /grab_blur_promo_here/norm nvatd
  /missing_page_buy_button/norm nvatd
  wq " Update changes and quit.
EOF

For multiple files, use bufdo and save all files at once via xa.

How to parse lots of html file with linux command or bash script

3 Answers3