07. DevOps about Operate
DevOps about Operate
Prerequisites
1. What is Operate in DevOps?
The Operate phase is where the deployed system is actually run, maintained, and kept stable in production.
Deployment is not the end. Operate is about what happens after the system is live.
Operate is the phase where a deployed system is continuously run, managed, and maintained to ensure stability, performance, and reliability.
2. Why Operate Matters
A system can:
- build successfully
- pass all tests
- deploy correctly
and still fail in production
❌ Without proper operation
- system crashes remain unnoticed
- performance gradually degrades
- memory leaks accumulate
- queues overflow silently
- users experience failures
✔ With proper operation
- issues are detected early
- performance remains stable
- failures are handled quickly
- system runs continuously (24/7)
3. Goals of the Operate Phase
The Operate phase ensures:
- the system keeps running without interruption
- performance remains within target limits
- failures are detected and handled quickly
- resources are used efficiently
- the system can recover when issues occur
4. What Happens During Operation?
4-1. Process Management
The system must:
- start correctly
- restart automatically if it crashes
- run continuously
Example
1
Process starts → runs loop → crash → auto-restart
Tools (examples):
- systemd (Linux)
- Docker restart policies
4-2. Runtime Monitoring (Basic Level)
Even before full monitoring systems, the application should:
- log important events
- report errors
- expose basic metrics
Example logs:
1
2
3
[INFO] Frame processed in 5ms
[WARN] Queue size increasing
[ERROR] Frame dropped
4-3. Resource Management
The system must not exhaust resources.
Monitor:
- CPU usage
- memory usage
- thread count
- queue size
❌ Example problem
1
Queue size keeps increasing → memory grows → crash
4-4. Failure Handling
Failures will happen.
The system must:
- handle errors gracefully
- avoid crashing when possible
- recover automatically
Example
1
Frame processing fails → skip frame → continue
Never stop the whole system for one bad frame
4-5. Continuous Operation (24/7)
Unlike test environments, production systems:
- run indefinitely
- must handle long-term stability
This connects directly to:
- soak testing
- memory leak prevention
- resource cleanup
4-6. Key Concepts
✔ Idempotency (important)
Running the same operation multiple times should not break the system.
✔ Fault tolerance
The system should continue running even when parts fail.
✔ Backpressure
If input is faster than processing:
- slow down input
- drop frames
- limit queue size
5. Real Example: Image Pipeline Operation
1
2
3
4
5
6
7
Camera
↓
Frame Queue
↓
Preprocessor (C++)
↓
Output
During operation, you must ensure:
- queue does not overflow
- processing stays within latency limits
- system recovers from temporary failures
- CPU usage remains stable
6. Common Problems in Operation
❌ Memory leaks
→ system crashes after hours or days
❌ Performance drift
→ latency increases over time
❌ Deadlocks
→ system freezes
❌ Resource exhaustion
→ no memory / threads left
❌ Silent failures
→ system runs but produces wrong output
7. Restart & Recovery Strategy
Operation must include recovery.
Example strategies
- auto-restart process
- watchdog monitoring
- fallback modes
- restart pipeline stage only
1
Crash → restart → resume processing
No manual intervention required
8. Automation in Operation
Manual operation is risky. Automation should handle:
- process restart
- log collection
- health checks
- scaling (if needed)
Example:
1
Container crashes → auto-restart