Post

07. DevOps about Operate

07. DevOps about Operate

DevOps about Operate


Prerequisites


1. What is Operate in DevOps?

The Operate phase is where the deployed system is actually run, maintained, and kept stable in production.

Deployment is not the end. Operate is about what happens after the system is live.

Operate is the phase where a deployed system is continuously run, managed, and maintained to ensure stability, performance, and reliability.

2. Why Operate Matters

A system can:

  • build successfully
  • pass all tests
  • deploy correctly

and still fail in production

❌ Without proper operation

  • system crashes remain unnoticed
  • performance gradually degrades
  • memory leaks accumulate
  • queues overflow silently
  • users experience failures

✔ With proper operation

  • issues are detected early
  • performance remains stable
  • failures are handled quickly
  • system runs continuously (24/7)

3. Goals of the Operate Phase

The Operate phase ensures:

  1. the system keeps running without interruption
  2. performance remains within target limits
  3. failures are detected and handled quickly
  4. resources are used efficiently
  5. the system can recover when issues occur

4. What Happens During Operation?

4-1. Process Management

The system must:

  • start correctly
  • restart automatically if it crashes
  • run continuously
Example
1
Process starts → runs loop → crash → auto-restart

Tools (examples):

  • systemd (Linux)
  • Docker restart policies

4-2. Runtime Monitoring (Basic Level)

Even before full monitoring systems, the application should:

  • log important events
  • report errors
  • expose basic metrics

Example logs:

1
2
3
[INFO] Frame processed in 5ms
[WARN] Queue size increasing
[ERROR] Frame dropped

4-3. Resource Management

The system must not exhaust resources.

Monitor:

  • CPU usage
  • memory usage
  • thread count
  • queue size
❌ Example problem
1
Queue size keeps increasing → memory grows → crash

4-4. Failure Handling

Failures will happen.

The system must:

  • handle errors gracefully
  • avoid crashing when possible
  • recover automatically
Example
1
Frame processing fails → skip frame → continue

Never stop the whole system for one bad frame

4-5. Continuous Operation (24/7)

Unlike test environments, production systems:

  • run indefinitely
  • must handle long-term stability

This connects directly to:

  • soak testing
  • memory leak prevention
  • resource cleanup

4-6. Key Concepts

✔ Idempotency (important)

Running the same operation multiple times should not break the system.

✔ Fault tolerance

The system should continue running even when parts fail.

✔ Backpressure

If input is faster than processing:

  • slow down input
  • drop frames
  • limit queue size

5. Real Example: Image Pipeline Operation

1
2
3
4
5
6
7
Camera
   ↓
Frame Queue
   ↓
Preprocessor (C++)
   ↓
Output

During operation, you must ensure:

  • queue does not overflow
  • processing stays within latency limits
  • system recovers from temporary failures
  • CPU usage remains stable

6. Common Problems in Operation

❌ Memory leaks

→ system crashes after hours or days

❌ Performance drift

→ latency increases over time

❌ Deadlocks

→ system freezes

❌ Resource exhaustion

→ no memory / threads left

❌ Silent failures

→ system runs but produces wrong output

7. Restart & Recovery Strategy

Operation must include recovery.

Example strategies

  • auto-restart process
  • watchdog monitoring
  • fallback modes
  • restart pipeline stage only
1
Crash → restart → resume processing

No manual intervention required

8. Automation in Operation

Manual operation is risky. Automation should handle:

  • process restart
  • log collection
  • health checks
  • scaling (if needed)

Example:

1
Container crashes → auto-restart
This post is licensed under CC BY 4.0 by the author.