29

I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?

Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.

Edit: I have a list of all of the URLs on the site that I need to check.

Liam
  • 19,819
  • 24
  • 83
  • 123

11 Answers11

7

Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).

lynx -dump http://www.example.com

It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:

lynx -dump http://www.example.com | grep -v "http"

The URLs could also be local (file://) if I have used wget to mirror the site.

I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).

This will ignore text in title and meta elements. These can be spellchecked seperately.

Liam
  • 19,819
  • 24
  • 83
  • 123
  • 2
    You can use wget -R to grab all your web pages recursively. Then, run lynx on the local files, and spellcheck from there. – strager Mar 10 '09 at 01:50
3

Just a view days before i discovered Spello web site spell checker. It uses my NHunspell (Open office Spell Checker for .NET) libaray. You can give it a try.

Thomas Maierhofer
  • 2,665
  • 18
  • 33
2

I highly recomend Inspyder InSite, It is commercial software but they have a trial available, it is well worth the money. I have used it for years to check the spelling of client websites. It supports automation/scheduling and can integrate with CMS custom word lists. It is also a good way to link-check and can generate reports.

Luke P M
  • 558
  • 5
  • 9
2

If you can access the site's content as files, you can write a small Unix shell script that does the job. The following script will print the name of a file, line number, and misspelled words. The output's quality depends on that of your system's dictionary.

#!/bin/sh

# Find HTML files
find $1 -name \*.html -type f |
while read f
do
        # Split file into words
        sed '
# Remove CSS
/<style/,/<\/style/d
# Remove Javascript
/<script/,/<\/script/d
# Remove HTML tags
s/<[^>]*>//g
# Remove non-word characters
s/[^a-zA-Z]/ /g
# Split words into lines
s/[     ][      ]*/\
/g ' "$f" |
        # Remove blank lines
        sed '/^$/d' |
        # Sort the words
        sort -u |
        # Print words not in the dictionary
        comm -23 - /usr/share/dict/words >/tmp/spell.$$.out
        # See if errors were found
        if [ -s /tmp/spell.$$.out ]
        then
                # Print file, number, and matching words
                fgrep -Hno -f /tmp/spell.$$.out "$f"
        fi
done
# Remove temporary file
rm /tmp/spell.$$.out
Diomidis Spinellis
  • 18,734
  • 5
  • 61
  • 83
  • +1 :: Even if you cannot get the site source files, you can use wget -m (mirror mode) to spider the site. – garrow Feb 25 '09 at 12:11
  • This does not filter out JavaScript and CSS embedded in the HTML. – Liam Feb 25 '09 at 12:35
  • Also, some words like 'at' and 'me' are output as misspelled words even though they are in the dictionary. – Liam Feb 25 '09 at 12:36
  • I modified the code to remove JavaScript and CSS. Note: the code is an example, you should modify it to make it fit your setup. – Diomidis Spinellis Feb 26 '09 at 06:01
1

You could do this with a shell script combining wget with aspell. Did you have a programming environment in mind?

I'd personally use python with Beautiful Soup to extract the text from the tags, and pipe the text through aspell.

Anthony Roy
  • 1,835
  • 3
  • 15
  • 16
1

If its a one off, and due to the number of pages to check it might be worth considering somthing like spellr.us which would be a quick solution. You can entering in your website url on the homepage to get a feel for how it would report spelling mistakes.

http://spellr.us/

but I'm sure there are some free alternatives.

kevchadders
  • 8,335
  • 4
  • 42
  • 61
0

I made an English-only spell checker with Ruby here: https://github.com/Vinietskyzilla/fuzzy-wookie

Try it out.

It's main deficiency is absence of a thorough dictionary that includes all forms of each word (plural, not just singular; 'has', not just 'have'). Substituting your own dictionary, if you can find or make a better one, would make it really awesome.


That aside, I think the simplest way to spell check a single web page is to press ctrl+a (or cmd+a) to select all text, then copy and paste it into a multiline text box on a web page. (For example <html><head></head><body><textarea></textarea></body></html>.) Your browser should underline any misspelled words.

David Winiecki
  • 4,093
  • 2
  • 37
  • 39
0

@Anthony Roy I've done exactly what you've done. Piped the page thru Aspell via Pyenchant. I have English dictionaries (GB, CA, US) for use at my site https://www.validator.pro/. Contact me and I will set up a one-time job for you to check 1000 pages or more

Scott Grodberg
  • 168
  • 1
  • 5
0

Use templates (well) with your webapp (if you're programming the site instead of just writing html), and an html editor that includes spell-checking. Eclipse does, for one.

If that's not possible for some reason... yeah, wget to download the finished pages, and something like this:

http://netsw.org/dict/tools/ispell-html-mode.patch

Lee B
  • 2,137
  • 12
  • 16
0

We use the Telerik RAD Spell control in our ASP.NET applications.

Telerik RAD Spell

Michael Kniskern
  • 24,792
  • 68
  • 164
  • 231
0

You may want to check out a library like jspell.

Jas Panesar
  • 6,597
  • 3
  • 36
  • 47