Miscellaneous - Data Units, Binary, and Character Encoding
π§ Data Units, Binary, and Character Encoding
A structured guide to CPU data units, binary representation, and character encoding,
focused on how computers actually store and process data.
1οΈβ£ Word β CPU Processing Unit
What is a Word?
A word is the natural data size a CPU can process in one operation.
It is tied to:
- Register width
- ALU operation size
- CPU architecture (16-bit, 32-bit, 64-bit)
Common Data Units
| Unit | Meaning | Typical Size |
|---|---|---|
| Half Word | Half of a word | 16 bits |
| Word | CPU-native size | 32 or 64 bits |
| Double Word (DWORD) | Twice a word | 64 bits |
| Quad Word (QWORD) | Twice a double word | 128 bits (SIMD) |
π On modern systems, word = 64 bits.
Why Word Size Matters
- Determines integer & pointer size
- Affects memory alignment
- Impacts CPU performance
Misaligned memory access can slow execution.
2οΈβ£ Binary vs Decimal β Number Representation
Binary (Base-2)
- Uses 0 and 1
- Native format for hardware logic
Example:
1
2
Binary: 10000β
Decimal: 16ββ
Decimal (Base-10)
- Human-friendly system
- Converted to binary internally
π CPUs always operate on binary.
3οΈβ£ Signed vs Unsigned Integers
| Type | Range (8-bit Example) |
|---|---|
| Unsigned | 0 to 255 |
| Signed (Twoβs Complement) | -128 to 127 |
Twoβs complement allows efficient negative number arithmetic.
4οΈβ£ Character Sets β Mapping Characters to Numbers
A character set maps text characters to numeric code points.
Example:
1
'A' β 65
It defines:
- Supported characters
- Numeric values assigned to them
5οΈβ£ Encoding vs Decoding
Encoding
Characters β Bytes
Decoding
Bytes β Characters
1
Text β Encode β Bytes β Decode β Text
Encoding mismatch causes mojibake (broken text).
6οΈβ£ ASCII β The First Standard
- 7-bit code points (0β127)
- Supports English letters, digits, symbols
Structure
- 7 bits = character
- 1 bit = optional parity bit
Example:
1
'A' = 65 = 1000001β
Limitations: β No multilingual support
7οΈβ£ Unicode β Universal Character Set
Unicode assigns a unique code point to every character worldwide.
Examples:
1
2
3
'A' β U+0041
'ν' β U+D55C
'π' β U+1F600
Unicode defines characters, not storage format.
8οΈβ£ Unicode Encodings β UTF-8, UTF-16, UTF-32
Unicode must be encoded into bytes.
UTF-8
- 1β4 bytes
- ASCII compatible
- Web & Linux standard
UTF-16
- 2β4 bytes
- Efficient for East Asian text
- Used in Windows, Java
UTF-32
- 4 bytes fixed
- Simple indexing
- Memory heavy
Encoding Comparison
| Encoding | Byte Size | Pros | Cons |
|---|---|---|---|
| UTF-8 | 1β4 | Efficient, standard | Variable length |
| UTF-16 | 2β4 | Good for CJK | Surrogate pairs |
| UTF-32 | 4 | Simple | High memory use |
9οΈβ£ How CPUs See Text
CPUs do not understand characters β only binary numbers.
Processing flow:
1
Characters β Code Points β Encoded Bytes β Binary β CPU
π Endianness β Byte Order
| Type | Order |
|---|---|
| Little Endian | Least significant byte first |
| Big Endian | Most significant byte first |
Important for networking & file formats.
π― Developer Takeaways
β Word size impacts memory & speed
β Binary is the CPUβs native format
β Unicode is required for global text
β UTF-8 is the modern default
β Encoding mismatch breaks text
β Endianness matters in low-level programming