I am mirroring a web site using the wget
command. And I wrote a script for that. The script takes a replica of the site through crontab
every day. The mirror is stored under /var/www
for accessing from browser (localhost). But I want to remove user input areas like login or search from the HTML files. I can parse the files manually, but I want to make parsing with a script. Can you help me?
Asked
Active
Viewed 2,586 times
-2

tripleee
- 175,061
- 34
- 275
- 318

user3388268
- 11
- 1
- 3
3 Answers
0
May be you are looking for something like this
cat your-html | sed -e 's/\<input.*type="text".*\>//g' | sed -e 's/\<input.*type="password".*\>//g' > new.html

iamsrijon
- 76
- 6
-
1Thanks for the tip,it helped me a lot.I want to do this for all html files in the directory,I try "find . -name "*.html" ; while" command before that,but it didnt work.Is there any way to run this command for all html files in the directory?thanks in advence. – user3388268 Mar 14 '14 at 07:45
0
Because you are not telling us what to fix, we can't help you with the specifics, but to remove foo
and </bar>
anywhere in the tree of HTML files, something like
find /var/www/mirror.example.com -type f -name '*.html' \
-exec sed -i 's/foo//;s%</bar>%%' {} \;
If your find
supports \+
instead of \;
this can be made somewhat more efficient.

tripleee
- 175,061
- 34
- 275
- 318
-
1This is one of the most frequently asked questions here; if you need more help, go look at the 10,000 near-duplicates. – tripleee Mar 12 '14 at 07:17
0
You may use Ex editor to edit the html page in-place, for example:
ex -V1 $PAGE <<-EOF
" Correcting missing protocol, see: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2359 "
%s,'//,'http://,ge
%s,"//,"http://,ge
" Correcting relative paths, see: https://github.com/wkhtmltopdf/wkhtmltopdf/issues/2359 "
%s,[^,]\zs'/\ze[^>],'http://www.example.com/,ge
%s,[^,]\zs"/\ze[^>],"http://www.example.com/,ge
" Remove the margin on the left of the main block. "
%s/id="doc_container"/id="doc_container" style="min-width:0px;margin-left : 0px;"/g
%s/<div class="outer_page/<div style="margin: 0px;" class="outer_page/g
" Remove useless html elements. "
/<div.*id="global_header"/norm nvatd
/<div class="header_spacer"/norm nvatd
/<div.*id="doc_info"/norm nvatd
/<div.*class="toolbar_spacer"/norm nvatd
/<div.*between_page_ads_1/norm nvatd
/id="leaderboard_ad_main">/norm nvatd
/class="page_missing_explanation/norm nvatd
/<div id="between_page_ads/norm nvatd
/<div class="b_..">/norm nvatd
/<div class="shadow_overlay">/norm nvatd
/grab_blur_promo_here/norm nvatd
/missing_page_buy_button/norm nvatd
wq " Update changes and quit.
EOF
For multiple files, use bufdo
and save all files at once via xa
.
See also:
and
– jimm-cl Mar 06 '14 at 13:27