How do you spell check a website?

Question

I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?

Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.

Edit: I have a list of all of the URLs on the site that I need to check.

Liam · Answer 1 · 2009-02-25T14:54:55.760

Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).

lynx -dump http://www.example.com

It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:

lynx -dump http://www.example.com | grep -v "http"

The URLs could also be local (file://) if I have used wget to mirror the site.

I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).

This will ignore text in title and meta elements. These can be spellchecked seperately.

You can use wget -R to grab all your web pages recursively. Then, run lynx on the local files, and spellcheck from there. — strager, Mar 10 '09 at 01:50

score 3 · Answer 2 · answered Sep 09 '09 at 18:10

3

Just a view days before i discovered Spello web site spell checker. It uses my NHunspell (Open office Spell Checker for .NET) libaray. You can give it a try.

answered Sep 09 '09 at 18:10

Thomas Maierhofer

2,665
18
33

Only seems to want to check HTTP URLs (i.e., not HTTPS). Works great otherwise. – geometrian Mar 13 '17 at 11:01

score 2 · Answer 3 · answered Sep 28 '10 at 14:04

I highly recomend Inspyder InSite, It is commercial software but they have a trial available, it is well worth the money. I have used it for years to check the spelling of client websites. It supports automation/scheduling and can integrate with CMS custom word lists. It is also a good way to link-check and can generate reports.

Diomidis Spinellis · Answer 4 · 2009-02-26T06:00:32.683

2

If you can access the site's content as files, you can write a small Unix shell script that does the job. The following script will print the name of a file, line number, and misspelled words. The output's quality depends on that of your system's dictionary.

#!/bin/sh

# Find HTML files
find $1 -name \*.html -type f |
while read f
do
        # Split file into words
        sed '
# Remove CSS
/<style/,/<\/style/d
# Remove Javascript
/<script/,/<\/script/d
# Remove HTML tags
s/<[^>]*>//g
# Remove non-word characters
s/[^a-zA-Z]/ /g
# Split words into lines
s/[     ][      ]*/\
/g ' "$f" |
        # Remove blank lines
        sed '/^$/d' |
        # Sort the words
        sort -u |
        # Print words not in the dictionary
        comm -23 - /usr/share/dict/words >/tmp/spell.$$.out
        # See if errors were found
        if [ -s /tmp/spell.$$.out ]
        then
                # Print file, number, and matching words
                fgrep -Hno -f /tmp/spell.$$.out "$f"
        fi
done
# Remove temporary file
rm /tmp/spell.$$.out

edited Feb 26 '09 at 06:00

answered Feb 25 '09 at 11:55

Diomidis Spinellis

18,734
5
61
83

+1 :: Even if you cannot get the site source files, you can use wget -m (mirror mode) to spider the site. – garrow Feb 25 '09 at 12:11
This does not filter out JavaScript and CSS embedded in the HTML. – Liam Feb 25 '09 at 12:35
Also, some words like 'at' and 'me' are output as misspelled words even though they are in the dictionary. – Liam Feb 25 '09 at 12:36
I modified the code to remove JavaScript and CSS. Note: the code is an example, you should modify it to make it fit your setup. – Diomidis Spinellis Feb 26 '09 at 06:01

score 1 · Answer 5 · answered Feb 25 '09 at 11:31

1

You could do this with a shell script combining wget with aspell. Did you have a programming environment in mind?

I'd personally use python with Beautiful Soup to extract the text from the tags, and pipe the text through aspell.

answered Feb 25 '09 at 11:31

Anthony Roy

1,835
3
15
16

score 1 · Answer 6 · answered Feb 25 '09 at 11:40

If its a one off, and due to the number of pages to check it might be worth considering somthing like spellr.us which would be a quick solution. You can entering in your website url on the homepage to get a feel for how it would report spelling mistakes.

http://spellr.us/

but I'm sure there are some free alternatives.

score 0 · Answer 7 · answered Sep 09 '13 at 22:26

I made an English-only spell checker with Ruby here: https://github.com/Vinietskyzilla/fuzzy-wookie

Try it out.

It's main deficiency is absence of a thorough dictionary that includes all forms of each word (plural, not just singular; 'has', not just 'have'). Substituting your own dictionary, if you can find or make a better one, would make it really awesome.

That aside, I think the simplest way to spell check a single web page is to press ctrl+a (or cmd+a) to select all text, then copy and paste it into a multiline text box on a web page. (For example <html><head></head><body><textarea></textarea></body></html>.) Your browser should underline any misspelled words.

Oh, yeah, that second option won't work too well for "thousands of pages". — David Winiecki, Sep 09 '13 at 22:28

score 0 · Answer 8 · answered Nov 29 '14 at 02:16

@Anthony Roy I've done exactly what you've done. Piped the page thru Aspell via Pyenchant. I have English dictionaries (GB, CA, US) for use at my site https://www.validator.pro/. Contact me and I will set up a one-time job for you to check 1000 pages or more

score 0 · Answer 9 · answered Feb 25 '09 at 11:48

Use templates (well) with your webapp (if you're programming the site instead of just writing html), and an html editor that includes spell-checking. Eclipse does, for one.

If that's not possible for some reason... yeah, wget to download the finished pages, and something like this:

http://netsw.org/dict/tools/ispell-html-mode.patch

score 0 · Answer 10 · answered Mar 10 '09 at 01:35

0

We use the Telerik RAD Spell control in our ASP.NET applications.

Telerik RAD Spell

answered Mar 10 '09 at 01:35

Michael Kniskern

24,792
68
164
231

score 0 · Answer 11 · answered Mar 10 '09 at 02:07

0

You may want to check out a library like jspell.

answered Mar 10 '09 at 02:07

Jas Panesar

6,597
3
36
47

How do you spell check a website?

11 Answers11