9. How does my computer store things in memory?

You probably know that everything on a computer is stored as strings of bits (binary digits; you can think of them as lots of little on-off switches). Here we'll explain how those bits are used to represent the letters and numbers that your computer is crunching.

Before we can go into this, you need to understand about the word size of your computer. The word size is the computer's preferred size for moving units of information around; technically it's the width of your processor's registers, which are the holding areas your processor uses to do arithmetic and logical calculations. When people write about computers having bit sizes (calling them, say, "32-bit" or "64-bit" computers), this is what they mean.

Most computers now have a word size of 64 bits. In the recent past (early 2000s) many PCs had 32-bit words. The old 286 machines back in the 1980s had a word size of 16. Old-style mainframes often had 36-bit words.

The computer views your memory as a sequence of words numbered from zero up to some large value dependent on your memory size. That value is limited by your word size, which is why programs on older machines like 286s had to go through painful contortions to address large amounts of memory. I won't describe them here; they still give older programmers nightmares.

9.1. Numbers

Integer numbers are represented as either words or pairs of words, depending on your processor's word size. One 64-bit machine word is the most common integer representation.

Integer arithmetic is close to but not actually mathematical base-two. The low-order bit is 1, next 2, then 4 and so forth as in pure binary. But signed numbers are represented in twos-complement notation. The highest-order bit is a sign bit which makes the quantity negative, and every negative number can be obtained from the corresponding positive value by inverting all the bits and adding one. This is why integers on a 64-bit machine have the range -263 to 263 - 1. That 64th bit is being used for sign; 0 means a positive number or zero, 1 a negative number.

Some computer languages give you access to unsigned arithmetic which is straight base 2 with zero and positive numbers only.

Most processors and some languages can do operations in floating-point numbers (this capability is built into all recent processor chips). Floating-point numbers give you a much wider range of values than integers and let you express fractions. The ways in which this is done vary and are rather too complicated to discuss in detail here, but the general idea is much like so-called ‘scientific notation’, where one might write (say) 1.234 * 1023; the encoding of the number is split into a mantissa (1.234) and the exponent part (23) for the power-of-ten multiplier (which means the number multiplied out would have 20 zeros on it, 23 minus the three decimal places).

9.2. Characters

Characters are normally represented as strings of seven bits each in an encoding called ASCII (American Standard Code for Information Interchange). On modern machines, each of the 128 ASCII characters is the low seven bits of an octet or 8-bit byte; octets are packed into memory words so that (for example) a six-character string only takes up one 64-bit memory word. For an ASCII code chart, type ‘man 7 ascii’ at your Unix prompt.

The preceding paragraph was misleading in two ways. The minor one is that the term ‘octet’ is formally correct but seldom actually used; most people refer to an octet as byte and expect bytes to be eight bits long. Strictly speaking, the term ‘byte’ is more general; there used to be, for example, 36-bit machines with 9-bit bytes (though there probably never will be again).

The major one is that not all the world uses ASCII. In fact, much of the world can't — ASCII, while fine for American English, lacks many accented and other special characters needed by users of other languages. Even British English has trouble with the lack of a pound-currency sign.

There have been several attempts to fix this problem. All use the extra high bit that ASCII doesn't, making it the low half of a 256-character set. The most widely-used of these is the so-called ‘Latin-1’ character set (more formally called ISO 8859-1). This is the default character set for Linux, older versions of HTML, and X. Microsoft Windows uses a mutant version of Latin-1 that adds a bunch of characters such as right and left double quotes in places proper Latin-1 leaves unassigned for historical reasons (for a scathing account of the trouble this causes, see the demoroniser page).

Latin-1 handles western European languages, including English, French, German, Spanish, Italian, Dutch, Norwegian, Swedish, Danish, and Icelandic. However, this isn't good enough either, and as a result there is a whole series of Latin-2 through -9 character sets to handle things like Greek, Arabic, Hebrew, Esperanto, and Serbo-Croatian. For details, see the ISO alphabet soup page.

The ultimate solution is a huge standard called Unicode (and its identical twin ISO/IEC 10646-1:1993). Unicode is identical to Latin-1 in its lowest 256 slots. Above these in 16-bit space it includes Greek, Cyrillic, Armenian, Hebrew, Arabic, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Thai, Lao, Georgian, Tibetan, Japanese Kana, the complete set of modern Korean Hangul, and a unified set of Chinese/Japanese/Korean (CJK) ideographs. For details, see the Unicode Home Page. XML and XHTML use this character set.

Recent versions of Linux use an encoding of Unicode called UTF-8. In UTF, characters 0-127 are ASCII. Characters 128-255 are used only in sequences of 2 through 4 bytes that identify non-ASCII characters.