Post

Miscellaneous - Data Units, Binary, and Character Encoding

Miscellaneous - Data Units, Binary, and Character Encoding

🧠 Data Units, Binary, and Character Encoding

A structured guide to CPU data units, binary representation, and character encoding,
focused on how computers actually store and process data.


1️⃣ Word β€” CPU Processing Unit

What is a Word?

A word is the natural data size a CPU can process in one operation.

It is tied to:

  • Register width
  • ALU operation size
  • CPU architecture (16-bit, 32-bit, 64-bit)

Common Data Units

UnitMeaningTypical Size
Half WordHalf of a word16 bits
WordCPU-native size32 or 64 bits
Double Word (DWORD)Twice a word64 bits
Quad Word (QWORD)Twice a double word128 bits (SIMD)

πŸ“Œ On modern systems, word = 64 bits.


Why Word Size Matters

  • Determines integer & pointer size
  • Affects memory alignment
  • Impacts CPU performance

Misaligned memory access can slow execution.


2️⃣ Binary vs Decimal β€” Number Representation

Binary (Base-2)

  • Uses 0 and 1
  • Native format for hardware logic

Example:

1
2
Binary: 10000β‚‚
Decimal: 16₁₀

Decimal (Base-10)

  • Human-friendly system
  • Converted to binary internally

πŸ“Œ CPUs always operate on binary.


3️⃣ Signed vs Unsigned Integers

TypeRange (8-bit Example)
Unsigned0 to 255
Signed (Two’s Complement)-128 to 127

Two’s complement allows efficient negative number arithmetic.


4️⃣ Character Sets β€” Mapping Characters to Numbers

A character set maps text characters to numeric code points.

Example:

1
'A' β†’ 65

It defines:

  • Supported characters
  • Numeric values assigned to them

5️⃣ Encoding vs Decoding

Encoding

Characters β†’ Bytes

Decoding

Bytes β†’ Characters

1
Text β†’ Encode β†’ Bytes β†’ Decode β†’ Text

Encoding mismatch causes mojibake (broken text).


6️⃣ ASCII β€” The First Standard

  • 7-bit code points (0–127)
  • Supports English letters, digits, symbols

Structure

  • 7 bits = character
  • 1 bit = optional parity bit

Example:

1
'A' = 65 = 1000001β‚‚

Limitations: ❌ No multilingual support


7️⃣ Unicode β€” Universal Character Set

Unicode assigns a unique code point to every character worldwide.

Examples:

1
2
3
'A' β†’ U+0041
'ν•œ' β†’ U+D55C
'πŸ˜€' β†’ U+1F600

Unicode defines characters, not storage format.


8️⃣ Unicode Encodings β€” UTF-8, UTF-16, UTF-32

Unicode must be encoded into bytes.

UTF-8

  • 1–4 bytes
  • ASCII compatible
  • Web & Linux standard

UTF-16

  • 2–4 bytes
  • Efficient for East Asian text
  • Used in Windows, Java

UTF-32

  • 4 bytes fixed
  • Simple indexing
  • Memory heavy

Encoding Comparison

EncodingByte SizeProsCons
UTF-81–4Efficient, standardVariable length
UTF-162–4Good for CJKSurrogate pairs
UTF-324SimpleHigh memory use

9️⃣ How CPUs See Text

CPUs do not understand characters β€” only binary numbers.

Processing flow:

1
Characters β†’ Code Points β†’ Encoded Bytes β†’ Binary β†’ CPU

πŸ”Ÿ Endianness β€” Byte Order

TypeOrder
Little EndianLeast significant byte first
Big EndianMost significant byte first

Important for networking & file formats.


🎯 Developer Takeaways

βœ” Word size impacts memory & speed
βœ” Binary is the CPU’s native format
βœ” Unicode is required for global text
βœ” UTF-8 is the modern default
βœ” Encoding mismatch breaks text
βœ” Endianness matters in low-level programming

This post is licensed under CC BY 4.0 by the author.