0

I am writing a markup document in Finnish.

I'm using the lang="fi-fi" attribute. Am I supposed to use the markup entities (ä for ä etc.) in conjunction with the language attribute, or is using the language attribute alone sufficient? How do the entities and language attribute affect each other?

The "problem" comes from the fact that the markup is written without entities and I have a script that's supposed to replace the scandic letters with entities by using regular expressions -- after defining the lang attribute the script doesn't appear to work anymore (which it supposedly did before adding the lang attribute).

My main concern is that the markup renders correctly regardless of the browser, although a "modern" browser can be assumed.

2 Answers2

0

The lang attribute and entities do completely different jobs.

The lang attribute tells the parser what human language the document is written in. This allows, for example, search engines to tell if it is a good document to present to Finish speakers and screen reader software to select the correct pronunciation library.

Entities just let you represent characters that you couldn't otherwise represent. e.g.

  • Because you can't type the character of your keyboard
  • Because the character encoding the document is saved in (e.g. ASCII) doesn't include the character. This century you should be using UTF-8 just about everywhere and shouldn't need to worry about that.
  • Because the character would otherwise have special meaning in HTML (e.g. <).

  • Always use a lang attribute if you know what language the text of the document will be written in
  • Always use entities for characters with special meaning in HTML
  • Use literal characters if you can be reasonably certain the character encoding won't be mangled (which you can be most of the time) as they use fewer bytes and are easier to read in source code.
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
0

The root of my problem was actually character encoding. Although all of the documents were defined with UTF-8, the script somehow didn't recognize it. By telling the script that the input files (that were supposed be fixed with entities) are UTF-8 encoded the script functions correctly again.

As an answer to the question in the heading: to be absolutely sure that the documents are compatible with the server -- yes, I am supposed to use entity encoding (though I understand that assuming that the server allows UTF-8 is pretty safe assumption in general as implied by Quentin). Due to other reasons (related to automatic content generating), I'm also supposed to use the lang attribute.