-2

I am mirroring a web site using the wget command. And I wrote a script for that. The script takes a replica of the site through crontab every day. The mirror is stored under /var/www for accessing from browser (localhost). But I want to remove user input areas like login or search from the HTML files. I can parse the files manually, but I want to make parsing with a script. Can you help me?

tripleee
  • 175,061
  • 34
  • 275
  • 318
user3388268
  • 11
  • 1
  • 3
  • Please describe in more detail how you want to alter the HTML files. Also, please show us what you have tried so far. –  Mar 06 '14 at 13:26
  • Are there any specific _tags_ you want to remove? Or the other way around, is there a specific _tag_ you want to keep? e.g.: keeping only the text between

    and

    – jimm-cl Mar 06 '14 at 13:27
  • I want to remove all tags that contain "login" or "search".I make this with using cat,grep,rm and mv for one html file but I want to make this for all html file with script. – user3388268 Mar 06 '14 at 13:47
  • possible duplicate of [Linux command line global search and replace](http://stackoverflow.com/questions/471183/linux-command-line-global-search-and-replace) – tripleee Mar 12 '14 at 07:20

3 Answers3

0

May be you are looking for something like this

cat your-html | sed -e 's/\<input.*type="text".*\>//g' | sed -e 's/\<input.*type="password".*\>//g' > new.html
iamsrijon
  • 76
  • 6
  • 1
    Thanks for the tip,it helped me a lot.I want to do this for all html files in the directory,I try "find . -name "*.html" ; while" command before that,but it didnt work.Is there any way to run this command for all html files in the directory?thanks in advence. – user3388268 Mar 14 '14 at 07:45
0

Because you are not telling us what to fix, we can't help you with the specifics, but to remove foo and </bar> anywhere in the tree of HTML files, something like

find /var/www/mirror.example.com -type f -name '*.html' \
    -exec sed -i 's/foo//;s%</bar>%%' {} \;

If your find supports \+ instead of \; this can be made somewhat more efficient.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • 1
    This is one of the most frequently asked questions here; if you need more help, go look at the 10,000 near-duplicates. – tripleee Mar 12 '14 at 07:17
0

You may use Ex editor to edit the html page in-place, for example:

ex -V1 $PAGE <<-EOF
  " Correcting missing protocol, see: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2359 "
  %s,'//,'http://,ge
  %s,"//,"http://,ge
  " Correcting relative paths, see: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2359 "
  %s,[^,]\zs'/\ze[^>],'http://www.example.com/,ge
  %s,[^,]\zs"/\ze[^>],"http://www.example.com/,ge
  " Remove the margin on the left of the main block. "
  %s/id="doc_container"/id="doc_container" style="min-width:0px;margin-left : 0px;"/g
  %s/<div class="outer_page/<div style="margin: 0px;" class="outer_page/g
  " Remove useless html elements. "
  /<div.*id="global_header"/norm nvatd
  /<div class="header_spacer"/norm nvatd
  /<div.*id="doc_info"/norm nvatd
  /<div.*class="toolbar_spacer"/norm nvatd
  /<div.*between_page_ads_1/norm nvatd
  /id="leaderboard_ad_main">/norm nvatd
  /class="page_missing_explanation/norm nvatd
  /<div id="between_page_ads/norm nvatd
  /<div class="b_..">/norm nvatd
  /<div class="shadow_overlay">/norm nvatd
  /grab_blur_promo_here/norm nvatd
  /missing_page_buy_button/norm nvatd
  wq " Update changes and quit.
EOF

For multiple files, use bufdo and save all files at once via xa.

See also:

Community
  • 1
  • 1
kenorb
  • 155,785
  • 88
  • 678
  • 743