Choosing a Character Encoding

Definitions

A codepage or a character set is a collection of characters. Historically, characters from different languages were divided into different character sets, because computers were able to "address" only a limited number of characters at a time. Thus codepages were defined to support specific languages or groups of languages with similar writing systems. For instance, codepage 1251 contains characters used in both Bulgarian and Russian alphabets.

Legacy character encodings use the Single-Byte Character Sets (SBCS) or the Multi-Byte Character Sets (MBCS), also known as Double-Byte Character Sets (DBCS). ) SBCS contains 256 character codes, while DBCS are a mixture of single-byte and double-byte characters and can represent up to 65,536 characters.

Modern character encodings such as Unicode use 16-bit character codes to represent most of the characters used throughout the world. Each Unicode index refers unambiguously to a given character. Compared to SBCS, Unicode allows for addressing a considerably larger range of characters. Compared to MBCS, Unicode offers a simplified model for working with text.

Selecting a Character Encoding

Inherently Unicode-based, DeltaWalker has a built-in functionality for detecting the character encoding of a given text file. Unicode encodings are typically easy to detect, thanks to a two-byte leading identifier, while SBCS don't lend themselves well for auto-detection. If a character encoding is detected incorrectly many, or all, characters would appear garbled, or unreadable.

In case DeltaWalker is unable to correctly detect the character encoding of a file, you can easily select it from one of several places:

Partial screenshot of Select File dialog

As illustrated on these screenshots, DeltaWalker allows you to select a charset encoding either by the language corresponding to that charset, or by the charset name itself. One or more languages can use a single encoding, or there could be an encoding without a corresponding language. Therefore when switching from Languages to Charsets DeltaWalker will always map the current language to its corresponding charset, but not the other way around.

The Editing preference page allows you to select the default encoding—language, or character set—for new text files created in DeltaWalker. Unless you overwrite the default encoding in say, the Set Encoding dialog, a new file will be saved on disk with the default encoding.

Related topics