|
As is well known, XML can use non-Latin characters. According to the XML Specification, XML uses ISO 10646, the international standard 31-bit character repertoire, which covers most human (and even some non-human) languages. This is currently congruous with Unicode and is going to be superset of Unicode.
|
|
To take an example (which may have some practical value in itself to people using nonLatin languages) assume that you have some text data which is expected to be,
say, in Swedish, German or Finnish and which appears to be such text with some characters replaced by oddities in a somewhat regular way. Locate some words which
probably should contain the letter "ä" but have something strange in place of it. Suppose further that the program you are using identifies
text data according to ISO 10646 by default and that the actual data is not accompanied with a suitable indication of the encoding, or such an indication is obviously
in error. Looking at what appears in your XML output instead of "ä", I may guess: ä
Let us suppose that we have an Excel file with name "Swans.xls". This file contents only one cell is filled by word: "Schwäne". By using RustemSoft XML Converter we can transform this excel file to XML format. Take a look what we might to get:
|
|
|
|
For this case the data is evidently in UTF-8 encoding. Notice that the characters à and ¤ stand here for octets 195 and 164, which might be displayed differently depending on browser and char-set used.
You may put "ä" in XML instead of "ä" for "ä". You can still refer to specific individual characters from elsewhere in the encoded repertoire by using HTML encoding.
Regardless of the specific encoding used, any character in the ISO 10646 character set may be referred to by the decimal or hexadecimal equivalent of its bit. So no matter which character set you personally use. You can use the &#xXXXX;
(hexadecimal character code, in uppercase) or &#DDDDD; (decimal character code) numeric character escapes as in HTML for your XML output. But you really do not need to do that. All XML processors must accept the UTF-8 and UTF-16 encodings
of ISO 10646. UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as ASCII, the rest are used to encode the rest of Unicode into sequences of between 2 and 6 bytes.
The RustemSoft XML Converter gives us ability to support non-Latin characters in XML by accepting the UTF-8 and UTF-16 encodings. People around the world can easily use it. On the "Options" menu bar you can adjust
that and select one of two ways. By playing on "Unicode UTF-8 encoding" section you can adjust non-Latin characters view in your final XML. You can choose "UTF8-encoded characters as literals" or "UTF8-encoded characters as HTML codes".
|

|
|
If you have decided to choose "UTF8-encoded characters as HTML codes" option then your "Swans.xls" file can have the following view after converting it to XML format:
|
|
|
|
For a better understanding of Unicode UTF-8 encoding you can visit Oscar van Vlijmen's page: http://www1.tip.nl/~t876506/utf8tbl.html
This page presents a table demonstrating the UTF-8 encoding and conversion algorithms for Unicode UTF-8.
|
Copyright © 2001-2008 RustemSoft
|