Data Access Worldwide Knowledge Base

Article ID 2024
Article Title INFO: Character Set Systems
Article URL http://www.dataaccess.com/kbasepublic/kbprint.asp?ArticleID=2024
KBase Category VDF (GENERAL)
Date Created 07/16/2004
Last Edit Date 10/24/2007


Article Text
QUESTION:
I would like to know how to transfer 'extended' characters via XML. By 'extended' I mean characters outside the 128 character range.

My program is sending data from DataFlex data files over the Internet in XML documents but the extended characters get 'translated' to unreadable ones...


SOLUTION:
Some information that might clear up the mysteries here. If you bear with me, I'll explain how to solve your problem at the end of this message.

Traditionally, there are basically two types of characters set systems: OEM and ANSI. They are each a single byte character sets, that is, each character is represented by a single byte. This means that the OEM character set, as well as the Ansi character set, can only represent 256 different characters. This is usually not enough to represent all different kinds of characters, such as extended Swedish characters as well as extended German
characters for example.

The solution to that problem was to introduce different character mappings. Basically, a Swedish mapping will include all extended Swedish characters, and a German mapping will include all German characters. It's important to understand that all possible extended
characters in the world cannot fit into one character mapping. So the Swedish one, while including extended Swedish characters, does not included extended German characters, and vice-versa.

These two character set systems, OEM and ANSI, have different mappings also including different characters. For example, there might be an OEM character mapping called 1234, which will among other things include all extended Swedish characters. And then there might be an ANSI character mapping called 5678, which will include all extended Swedish characters. Since these two mappings belong to two different character set systems, they are not identical. They might both include all extended Swedish characters, but
they might have different mappings and consequently the extended characters will have different values depending on the character set system.

To make this more complicated, Visual DataFlex, as DataFlex traditionally have, always uses the OEM character set system, whereas Windows traditionally uses the ANSI character set system.

So, VDF converts between the active OEM character mapping (or codepage), and the active ANSI character mapping (or codepage) all the time. This process is what we generally refer to OEM/ANSI conversion.

So far it's all pretty clear. Unicode, on the other hand, is different from both OEM and ANSI. Unicode is a universal character set. Unicode is a multibyte character set, where each character can be represented by more than one byte. This means that Unicode can represent more than 64000 different characters (>64K). This leaves room for all extended characters in the world in one huge character set. Since all extended characters fit into the same character set, there's no reason to resort to special character mappings (or codepages).

However, Unicode can be encoded in different encoding styles. This is the UTF-8 and UTF-16 encodings, for example. All this means is that each character is by default encoded as either a one- or two-byte sequence.

In that case, if their character needs more than one or two bytes to be represented, the first byte will have a special value which tells the size of the next character. This is very different from code pages, it's merely a means to "compress"(for lack of a better word) the data.


Now, why are you having problems represent extended characters in XML?
Well, the problem lies in the different character set systems you're using. You say that you specify UTF-8 in the XML document, but the problem is that the data that you're writing to the document is not in UTF-8. As you can imagine, this will throw off anyone trying to read the document and interpret the text as in Unicode with the UTF-8 encoding style.

Why is your data not in UTF-8?
Because as you may remember, VDF uses the OEM character set system. So when you write data to the XML document, you're writing it as OEM, but you're telling whoever is reading it that it's UTF-8. And that will, of course, cause problems.

How do you fix this?
You need to make sure that the data that you write in the document is of the same format as you say it is. Next you also need to make sure that the data is in a format that the reader understands. The best way to do this is to use Unicode and the UTF-8 encoding style. This means that before you write the data to the file, you need to convert it from your format (which is OEM) to UTF-8.

Fortunately, if you use the cXMLHttpTransfer class, all you need to do is to set a  property, and it will do all the conversion for you. Now that's simple, isn't it? If on the other hand you're not using the cXmlHttpTransfer class, then you will need to do the conversion manually.



Contributed By:
Sonny Falk
Company: Data Access Worldwide
Web Site: http://www.dataaccess.com

Web Links Related to this Article
DAW Knowledge Base article 1343: HOWTO: Add '< ?xml version="1.0" encoding="ISO8859-1"?>' to XML files
URL=http://www.dataaccess.com/KBasePublic/KBPrint.asp?ArticleID=1343

Microsoft - About Unicode and Character Sets
URL=http://msdn2.microsoft.com/en-us/library/ms776408.aspx

Microsoft - Character Sets
URL=http://www.microsoft.com/typography/unicode/cs.htm

Microsoft - Code Page Identifiers
URL=http://msdn2.microsoft.com/en-us/library/ms776446.aspx

SQL Server - Copying Data Between Different Code Pages
URL=http://doc.ddart.net/mssql/sql70/impt_bcp_35.htm

Unicode
URL=http://www.unicode.org/


Email this Article
Email this Article to a Colleague
Send Feedback on this Article to Data Access Worldwide
Copyright ©2010 Data Access Corporation. All rights reserved.

The information provided in the Data Access Technical Knowledge Base is provided "as is" without warranty of any kind. Data Access Corporation disclaims all warranties, either express or implied, including the warranties of merchantability and fitness for a particular purpose. In no event shall Data Access Corporation or its suppliers be liable for any damages whatsoever including direct, indirect, incidental, consequential, loss of business profits or special damages, even if Data Access Corporation or its suppliers have been advised of the possibility of such damages. Some states do not allow the exclusion or limitation of liability for consequential or incidental damages so the foregoing limitation may not apply.