|
As of Jan 1, 2013, this page is no longer being maintained.
Non-Latin script problems?
Multi-lingual web pages and Unicode
|
|
Are you getting ???? or or or Yíäýñíèé or other mojibake instead of the correct text for some languages? It's probably because your computer system can't display all our Unicode correctly. The good news is that most Unicode display problems can be fixed.
How do I fix Unicode display problems on my computer?
To display text in many different alphabets on one web page (e.g. Languages A-Z), we use Unicode, even though Unicode can create display problems for some computer systems. This web page offers solutions for those problems.
It may be that to "see" everything correctly on our Unicode pages, you only need to upgrade your browser and install, at most, one font, Code2000. Basically, you need:
- a Unicode compatible operating system (see Assistance: Introduction);
- a Unicode enabled browser (Assistance: Step 1); and
- Unicode-compatible font(s) (Assistance: Step 2);
and then (depending on which languages you want to display) you may need to:
- configure your browser (Assistance: Step 3 and Assistance: Step 4).
See also Display Problems? on the Unicode site and Help: Multilingual support on Wikipedia.
- What is Unicode? It is one of several systems (called encodings) that have been developed to manage the display of characters on-screen, but it is the first system that can assign a unique number (code) to every character in each of the world's major languages. (Other systems don't allow for enough characters and they also conflict with one another. That is, two encodings might use the same number for two different characters, or use different numbers for the same character.) Not all computer systems in current use are fully Unicode compatible.
Windows 7 comes with full support for Unicode (whether you're using using Firefox, Opera, Chrome, Safari or Internet Explorer), and Mac OS X 10.7 (Lion) is not far behind.
- Encoding: a system of assigning numbers to characters (i.e. letters, punctuation, and mathematical notations) so a computer knows which character to display. Hundreds of different systems (encodings) have been developed and used. Unicode is one of them. Here are examples of how encodings are specified in the head of an html page:
- charset=iso-8859-1 (for Western No.1),
- charset=BIG5 (for Traditional Chinese), and
- charset=utf-8 (for Unicode).
- Code: the number assigned to the character. Problems happen when different encodings use the same code for two different characters, or use different codes for the same character. Synonyms for "code" that are also in use: code position, code number, code value, code element, code set value.
- Language Script: the group of characters used to express a language in writing. Also called the "character set" or "character repertoire" or "alphabet" or "writing system" of a language.
- Font: the font determines the way a character will actually look on the screen (or on a printed page). For instance, this "A" in a sans-serif font looks different than this "A" in a serif font, but it is still the same character. (The "A" and the "A" are known as different "glyphs" of the same character. A font is basically a collection of glyphs. Also note that "A" and "a" are two different characters.)
Most fonts don't come close to containing all possible characters in the world—instead they contain ranges (also called "blocks") of characters (e.g. in Unicode, the codes (i.e. numbers) for Arabic characters are found in the range of 0660 to 06FF). Unicode currently defines over 100 ranges, and for example, the newest, Unicode-compatible versions of:
- Arial (with 2792 characters and 3381 glyphs) and
- Times New Roman (with 2790 characters and 3380 glyphs),
contain only 39 ranges, while the:
- Akaash font (409 characters; 642 glyphs), specifically for Bengali,
is also Unicode-compatible, yet contains only 4 ranges: Basic Latin; Latin-1 Supplement; Latin Extended-A; and Bengali.
- NOTE: "language script" and "range" are sometimes synonymous, but some languages require characters from more than one range and even non-contiguous ranges (e.g. Vietnamese, and especially CJK (Chinese-Japanese-Korean). CJK ideographs now encompass at least three ranges in two separate "planes" of Unicode.
- For more information, see also:
Assistance: Introduction
Because Unicode is a relatively recent development that wasn't consistently utilized within the wide range of operating systems that surfers have used over the years, i.e.:
- Windows 95/98/ME/NT/2000/XP/Vista/Windows 7,
- Mac OS 8/OS 9/OS X,
- Linux (various versions), etc.,
or within the wide range of browsers (and browser versions):
- FireFox,
- Internet Explorer,
- Opera,
- Mozilla,
- Netscape,
- Chrome,
- Safari, etc.,
not all computer systems are currently fully Unicode compatible.
- Windows 7 comes with full support for Unicode, including fonts, whether you're using Firefox, Opera, Chrome, Safari or Internet Explorer, and Mac OS X 10.7 (Lion) is not far behind.
- Some Unicode support has been included in Mac OS since Mac OS 8.5, but prior to Mac OS X (10) only limited use was been made of it by applications.
- Windows NT/2000/XP/Vista are based on Unicode, and some Unicode support has been included in Microsoft Windows since Windows 95.
- I've never used Linux (well, except for that one time).
If you have display problems with some of the links and/or text on our pages, you can try the steps set out below. My intention is to bring together, in one place, useful information I found when I was trying to figure out how to fix my own display problems, and to make that information as easy to understand as possible. Do keep in mind, though, that you don't have to understand everything here in order to get the hoped for results from carrying out the steps. Again, it may turn out that to "see" everything on our pages correctly, you may only need to upgrade your browser and install, at most, one font.
The suggestions I offer come from my experience using the following browsers and operating systems:
- with a Windows 7 operating system, I've used:
- FireFox 11 & 12
- Opera 11.6
- Chrome 18
- Safari 5.1, and
- Internet Explorer (IE) 9.
- with a Windows XP operating system, I've used:
- Firefox 2 to 11
- Netscape 7 & 8
- Mozilla 1.5 & 1.7
- Opera 7 to 11.6
- Chrome 5 & 18
- Safari 3, 4 & 5, and
- Internet Explorer (IE) 6 to 9.
- with a Windows 98 operating system, I've used:
- Netscape 4.79 & 7
- Mozilla 1.2.1 and 1.3b, and
- Internet Explorer (IE) 5.5.
(I think some of my suggestions could be useful for those with Windows 2000, NT 4 and Vista, and maybe even Windows 95.)
Because I only do Windows, the best I can offer those with other operating systems is to send you off-site to:
although some of what I say below may be applicable.
Assistance: Step-by-step
Step 0: You need a a Unicode compatible operating system (see
Introduction above for information)
Step 1: Selecting a browser
Step 2: Obtaining Unicode compatible fonts
Step 3: Configuring your browser by selecting fonts
Step 4: Configuring your browser by selecting encodings
NOTE: Most encodings are still used somewhere on the web, and these steps can be applied to all encodings, not just Unicode. However, if you are interested in viewing pages in a different encoding, such as Big5 (for Traditional Chinese) for example, in Step 2 you would need to make sure you had Big5-compatible fonts, rather than Unicode-compatible fonts.
Step 1: Selecting a browser
After I went through everything in Steps 1 to 4 above, and then browsed the Unicoded HotPeachPages and EarthWords pages:
- Actually, with Windows 7, I didn't have to go through any of the steps above, because Windows 7 comes with full support for Unicode, including fonts!! (except maybe Myanmar languages). All I had to do was turn on my new Windows 7 computer, use the built-in Internet Explorer to download the other browsers I'm reporting on, and then with:
- Firefox 11
- Opera 11.6
- Chrome 18
- Safari 5.1.4 and
- Internet Explorer 9,
I did not have any character display problems at all (and you shouldn't either, except maybe Myanmar languages).
- On Windows XP, FireFox 2 to 11, Opera 8, 9 & 10 and Netscape 7 & 8, displayed everything pretty much correctly. (Conjuncts & re-ordering for Khmer didn't work properly until I installed KhmerUnicode2 (for Window XP) on April 22/10.)
- IE 5.5 (Win 98), and 6, 7 & 8 (Win XP) displayed everything pretty much correctly. (Again conjuncts & re-ordering for Khmer didn't work properly until I installed KhmerUnicode2 (for Window XP) on April 22/10.)
Caveat: On Win XP, for the HTML <title> attribute, IE 8 displays empty rectangles () for Amharic, Sinhala and Tigrigna (even though the text for the link itself displays fine), whereas Moz-based browsers and Opera display the title text correctly. To see if you have the same issue, go to Domestic violence is more than just physical abuse using IE, and hover over the Amharic and Tigrigna language links at the top of the page to make the title boxes pop-up. Let me know by email if you know how to fix it, or if you don't even have the issue in IE 8.
- On Windows XP, Chrome 5 to 18 didn't display Sinhala and had the same issue described above for IE in the Caveat. Everything else was pretty much fine.
- Netscape 7 and Mozilla 1.2 & 1.3 (Win 98), and Mozilla 1.5 & 1.7 and Opera 7.2 & 7.5 (Win XP) all displayed Arabic and Hebrew correctly right-to-left, but didn't produce conjuncts or re-ordering for Indic scripts.
- Netscape 4.79 (Win 98) and Opera 7.1 (Win XP) displayed Arabic and Hebrew left-to-right, i.e. incorrectly, and didn't produce conjuncts or re-ordering for Indic scripts.
- Safari:
For more information about these and other browsers, go to:
- Alan Wood's:
and
- Wikipedia's:
Step 2: Obtaining Unicode compatible fonts
Make sure you have a Unicode-compatible font for either all the Unicode ranges, or for each of the language scripts you want to be able to display.
NOTE: To see what fonts you already have in your system, look in your Control Panel under Fonts. This will also give you the address of your FONTS file for when you want to intall a new font.
- Easiest: If you have either of the two currently available universal fonts:
- Arial Unicode MS (with almost 39,000 characters and over 50,000 glyphs in 65 ranges) was originally supplied through Microsoft Office 2000 and later, FrontPage 2000 and later, and Publisher 2002 and later, and was bundled with Mac OS X v10.5 and later. Now it is supplied with Windows 7. If you don't have any of these products, Arial Unicode MS can be purchased from Ascender Corporation, which licenses it from Microsoft,
OR
- Code2000* (over 50,000 characters and 60,000 glyphs in 105 ranges) is a free download, $5 honour-system registration,
you should be OK for most languages on our pages. In other words, to "see" everything on our pages, as I've said, you only need to upgrade your browser and install, at most, Code2000. Easy. (And the reason it's so easy, and inexpensive, is because James Kass worked on Code2000 for years as a labour of love and then basically gifted it to the world. James, you rock!)
*Note: Code2000 is OK in a pinch but not recommended for Chinese Simplified or Traditional, or for Japanese, and Arial Unicode MS is not OK for Lao (as of Office XP), but anyone who can read them probably already has appropriate fonts on their computer.
- Extra work: Because fonts designed for just one particular language script often present that script better than fonts that display several scripts, you may want to download further specific Unicode-compatible fonts for certain languages. On our EarthWords pages, for instance, we code a preference for the following fonts:
and we leave the rest up to the user's choices in Step 3, for which you need at least:
- Arial Unicode MS (again, originally supplied through Microsoft Office 2000 and later, FrontPage 2000 and later, and Publisher 2002 and later, and bundled with Mac OS X v10.5 and later. Now it is supplied with Windows 7. If you don't have any of these products, Arial Unicode MS can be purchased from Ascender Corporation, which licenses it from Microsoft,) OR
- Code2000 (again, free download, $5 honour-system registration).
In other words, to "see" everything on our pages almost exactly the way we intended, you only need to upgrade your browser and install, at most, five or six fonts. No big deal.
- Maximum effort: Because sites other than ours will prompt for fonts other than those mentioned above, you may want to download a whole whack of fonts. I suggest starting at Alan Wood's Unicode Resources*.
*NOTE: even though this page of Alan's is entitled "Unicode Fonts for Windows computers", it also has links for Mac and Unix.
*ALSO: Raghindi (listed on Alan's page under Devanagari Fonts) has been known to cause a conflict with other fonts on Windows 9x, including Code2000. It seems that many fonts produced for Windows 2000-and-up lack the ASCII characters required for backwards compatibility on earlier
versions of Windows. Installing such fonts on Win 9x is not recommended, as they have a tendency to "take
over" the system. The Raghindi is the only one I know about, but apparently there are others.
Step 3: Configuring your browser by selecting fonts
This is where you can choose a font for each language (aka writing system aka language script), but most languages are displayed fine with the default font your browser has chosen, so really, you only need to go in there if you don't like the default font for a particular language, or if a particular language is not displaying correctly with the default font. Here's where you go to select fonts for various browswers:
- IE: Tools > Internet Options > Fonts > Language script
- Opera: Tools/Settings > Preferences > Advanced > Fonts > International Fonts > Writing system
- Firefox: Tools > Options > Content > Advanced (Fonts & Colors) > Fonts for
- Netscape 8: Tools > Options > Browser Options, General > Fonts & Colors > Fonts for
- Safari: Edit > Preferences > ?
This step reveals a significant difference between Mozilla-based browsers (FireFox, Netscape and Mozilla) on the one hand, and IE (& Opera) on the other:
- for any particular language, IE and Opera have you choose only from fonts that will work with that language (usually no more than 10 will be on the list on my system).
- for every language, Mozilla browsers give you every font on your system to choose from (hundreds on mine), and if you have no idea what you are looking for, you'll be lost.
So what I do is, I use IE to see which fonts work with a particular language, and then I know what to look for in Firefox. If there's no font listed for a particular language, and it isn't displaying correctly, you have to go back to Step 2: Obtaining Unicode compatible fonts.
Alan Wood offers directions for configuring various browsers (not the latest versions, but probably still helpful) at Unicode and Multilingual Web Browsers. To help with the decisions about which fonts to choose for what, the following chart sets out font options for Netscape encodings and for IE language scripts that should work (it's very outdated, but I just can't bring myself to delete it.)
Chart adapted from
Yale University Library Workstation Support Group
Netscape (4.x and up) Font Options
|
IE (5.5/6) Font Options
|
Encoding |
Variable width font |
Fixed width font |
Western
(ISO-8859-1) |
(any number of options) |
(any number of options) |
Central European (ISO-8859-2)
(Windows-1250 |
Bitstream Cyberbit, Times New Roman |
Courier New |
Japanese
(Auto-Detect)
(Shift-JIS)
(EUC-JP) |
Arial Unicode MS, MS Gothic |
Arial Unicode MS, MS Gothic |
Traditional Chinese
(Big5)
(EUC-TW) |
Arial Unicode MS, MingLiU |
Arial Unicode MS, MingLiU |
Simplified Chinese
(GB2312) |
Arial Unicode MS, MS Song |
Arial Unicode MS, MS Hei |
Korean
(Auto-Detect) |
Arial Unicode MS, Code2000, GulimChe |
Arial Unicode MS, Code2000, GulimChe |
Cyrillic
(KOI8-R)
(ISO8859-5)
(Windows-1251)
(CP866) |
Arial Unicode MS, Code2000, Times New Roman |
Courier New |
Baltic
(ISO-8859-4)
(Windows-1257) |
Arial Unicode MS, Code2000, Times New Roman |
Courier New |
Greek
ISO-8859-7)
(Windows-1253) |
Arial Unicode MS, Code2000, Times New Roman |
Courier New |
Turkish
(ISO-8859-9) |
Arial Unicode MS, Bitstream Cyberbit, Code2000, Times New Roman |
Courier New |
Unicode
(UTF-8)
(UTF-7) |
Arial Unicode MS, Code2000 |
Arial Unicode MS, Code2000 |
UserDefined |
Arial Unicode MS, Code2000 |
Courier New, Courier New Baltic |
|
Language script |
Web page font |
Plain text font |
Arabic |
Arabic Transparent , Arial Unicode MS, Bitstream
Cyberbit , Tahoma, Traditional Arabic & ... |
|
Armenian |
Arial Unicode MS, Code2000 |
|
Bengali |
Akaash, Arial Unicode MS, Code2000 |
|
Braille |
Code2000 |
|
Burmese |
|
|
CanSyllabic |
Aboriginal Serif, Aboriginal Serif Unicode, Ballymun RO, Code2000 |
|
Cherokee |
Aboriginal Serif, Code2000 |
|
Chinese Simplified |
Arial Unicode MS, Bitstream Cyberbit, MS Hei, MS Song, simSun-18030 |
MS Hei, MS Song |
Chinese Traditional |
Arial Unicode MS, Bitstream Cyberbit, MingLiU |
MingLiU |
Cyrillic |
Times New Roman & ... |
Courier New , Andale Mono, Lucida Console |
Devanagari |
Alpha-demo, Arial Unicode MS, Code2000, shiDeva |
|
Ethiopic |
Code2000, Ethiopia Jiret, GF Zemen Unicode, TITUS Cyberbit Basic |
Ethiopia Jiret |
Georgian |
Arial Unicode MS, Code2000, TITUS Cyberbit Basic |
|
Greek |
Times New Roman & ... |
Courier New Andale Mono Lucida Console |
Gujarati |
Arial Unicode MS, Code2000, Shruti |
|
Gumukhi |
Arial Unicode MS, Code2000. Raavi |
|
Hebrew |
David, Miriam & ... |
Mirian Fixed Fixed Miriam Transparent Rod |
Japanese |
Arial Unicode MS, Bitstream Cyberbit, MS Gothic, MS
Mincho |
MS Gothic, MS Mincho |
Kannada |
Arial Unicode MS, Code2000. Tunga |
|
Khmer |
Code2000, Khmer OS |
|
Korean |
Arial Unicode MS, Batang, Bitstream Cyberbit, Code2000, GulimChe |
GulimChe |
Lao |
Saysettha Unicode, Saysettha OT, VangVieng Unicode, XiengThong Unicode, Alice5 Unicode, Alice3 Unicode, Alice4 Unicode, Alice0 Unicode, Alice1 Unicode, Alice2 Unicode |
|
Latin based |
(any number of options) |
Courier New .... |
Malayalam |
Arial Unicode MS, Code2000, Kartika |
|
Mongolian |
Code2000 (?) |
|
Ogham |
Code2000, TITUS Cyberbit Basic |
|
Orriya |
Arial Unicode MS, Code2000 |
|
Runic |
Abiriginal Serif Unicode, Code2000, TITUS Cyberbit Basic |
|
Sinhala |
Dinamina, Potha |
|
Syriac |
Code2000, Estrangelo Edessa, TITUS Cyberbit Basic |
|
Tamil |
Arial Unicode MS, Code2000, Latha, TabAvarangal2 |
|
Telugu |
Arial Unicode MS, Code2000, Gautami |
|
Thaana |
Code2000, Mv Boli, TITUS Cyberbit Basic |
|
Thai |
Cordia New, Angsana New, Arial Unicode MS, Bitstream
Cyberbit, Code2000, IrisUPC, Microsoft Sans Serif, Saysettha OT, Tahoma |
Courier Mono Thai |
Tibetan |
Arial Unicode MS, NSimSun-18030, SimSun-18030 |
|
UserDefined |
Arial Unicode MS & all |
Courier New ALA ... |
Yi |
Code2000, NSimSun-18030, SimSun-18030 |
|
|
Step 4: Configuring your browser by selecting the right encoding
You really only need to do this if the text on a page is gibberish. When that happens, the first thing you want to check is what encoding your browser is using. It may need changing. It's quite easy to check and even change encoding. Just click on 'View' on the top menu bar of any browser and then click:
- 'Character Encoding' for the Mozilla browsers (Firefox, Netscape and Mozilla);
- 'Encoding' for IE and Opera; and
- 'Text Encoding' for Safari
The encoding with the dot or check mark is the one being used. You can take an educated guess as to what you should change to depending what language the gibberish is supposed to be. For example, choose one of the Japanese enclodings if the gibberish is supposed to be Japanese. Then just keep choosing till it works.
More detailed directions on how to select encodings for various versions of different browsers (again, not the latest versions, but probably still helpful) can be found on the same pages where the directions for Step 3 are located (i.e. go to Alan Wood's Unicode and Multilingual Web Browsers, click on a browser, then scroll down to the end of the instructions for selecting Fonts till you see the instructions for Encodings).
Acknowledgements
Thank you especially to James Kass (last archive copy of James Kass' website on the Wayback Machine), Jukka "Yucca" Korpela and Alan Wood. Were it not for their work and excellent material freely available on the web (and James Kass' generous help and suggestions), I would understand very little about encoding systems, or about Unicode and how to use it, and the above would not exist.