今日发现的codex崩溃bug,根因分析与解决方法
LT;DR:
这几天codex一直崩溃推出,claudecode分析之后发现了codex的bug
解决:
备份一下codex的日志数据库,之后清空日志数据库,现在还没有收到官方的反馈
Bug Report: codex CLI crashes with zsh: trace trap due to SQLite WAL checkpoint contention
Metadata
| Field | Value |
|---|---|
| Bug title | CLI crashes with SIGTRAP (trace trap) when logs_2.sqlite exceeds ~200MB |
| Product | OpenAI Codex CLI (@openai/codex npm package) |
| Version | 0.141.0 |
| OS | macOS 14.5 (23F79) x86_64 |
| Hardware | MacBookPro16,1 — Intel Core i7-9750H |
| Severity | High — renders CLI unusable after extended usage |
| Date discovered | 2026-06-19 ~ 2026-06-20 (3 separate crash events) |
Summary
The codex CLI binary (Rust native executable spawned by the Node.js wrapper) crashes with SIGTRAP when the logs_2.sqlite database grows beyond ~200MB. The root cause is a Rust panic in a background task triggered by write contention between multiple concurrent threads writing to the same SQLite database under WAL mode, compounded by heavy WAL checkpoint I/O. The panic triggers abort() (because the binary is compiled with panic = "abort"), which on macOS generates SIGTRAP, killing the process and producing zsh: trace trap codex in the terminal.
Three crash events were recorded by macOS as .diag files in /Library/Logs/DiagnosticReports/, all triggered by the system disk writes resource monitor after exceeding 2,147 MB of file-backed memory writes within a single session.
Environment
System
ProductName: macOS
ProductVersion: 14.5
BuildVersion: 23F79
Architecture: x86_64 (Intel)
CPU: Intel Core i7-9750H @ 2.60GHz
Active CPUs: 12
codex installation
- Package:
@openai/codex@0.141.0(installed globally via npm) - Node.js:
v23.11.0(Homebrew,/usr/local/Cellar/node/23.11.0/) - Installation path:
/usr/local/Cellar/node/23.11.0/lib/node_modules/@openai/codex/ - Architecture: Two-layer —
- Wrapper: Node.js ESM script (
bin/codex.js) — thin Node.js launcher - Native binary: Rust-compiled Mach-O 64-bit executable at
node_modules/@openai/codex-darwin-x64/vendor/x86_64-apple-darwin/bin/codex - Rust toolchain:
1.95.0 - Compilation mode:
panic = "abort"(confirmed via embedded strings)
- Wrapper: Node.js ESM script (
- Code signature:
- Identifier:
codex - Team: OpenAI OpCo, LLC (
2DC432GLL2) - Format: Mach-O thin (x86_64), Hardened Runtime enabled
- Runtime Version:
15.5.0
- Identifier:
SQLite databases in ~/.codex/
| File | Size | Description |
|---|---|---|
logs_2.sqlite |
229,408,768 bytes (~229 MB) | Log entries (92,060 rows) |
logs_2.sqlite-wal |
16,463,552 bytes (~16 MB) | WAL journal |
logs_2.sqlite-shm |
32,768 bytes | Shared memory index |
state_5.sqlite |
1,335,296 bytes | Session/state data |
state_5.sqlite-wal |
4,152,992 bytes | WAL journal |
goals_1.sqlite |
24,576 bytes | Goals data |
memories_1.sqlite |
40,960 bytes | Memories data |
Crash Events — Timeline
Event 1 (earliest, most severe)
| Field | Value |
|---|---|
| Time | 2026-06-19 05:21:06 ~ 06:19:29 (+0800) |
| Duration | 3,502 seconds (~58 minutes) |
| PID | 65952 |
| Path | /Applications/Codex.app/Contents/Resources/codex |
| Samples | 110 (100% in kernel-mode write) |
| Threads | 8 |
| Disk writes | 2,147.52 MB over 3,502s (613 KB/s avg) |
| CPU time | 52.853s |
| Memory | 60.20 MB → 130.43 MB (+70.23 MB) |
| Stack | 67% sqlite3_exec → sqlite3_step → pwrite36% inside sqlite3_wal_checkpoint_v2 → pwrite |
Key: WAL checkpoint consumed 36% of all CPU samples — checkpointing a 229MB database was the bottleneck.
Event 2
| Field | Value |
|---|---|
| Time | 2026-06-19 21:20:51 ~ 21:57:14 (+0800) |
| Duration | 2,184 seconds (~36 minutes) |
| PID | 19112 |
| Path | /Applications/Codex.app/Contents/MacOS/Codex (Desktop app) |
| Samples | 44 |
| Threads | 7 |
| Disk writes | 2,147.54 MB over 2,184s (983 KB/s avg) |
| Memory | 96.77 MB → 199.92 MB (+103 MB, max 486 MB) |
| Stack | 93% uv_cancel → uv__fs_post → write (libuv async filesystem) |
Key: Desktop app also triggered the same disk-write limit, suggesting the shared ~/.codex/ state directory was the common factor.
Event 3 (latest)
| Field | Value |
|---|---|
| Time | 2026-06-20 01:31:12 ~ 03:11:21 (+0800) |
| Duration | 6,306 seconds (~105 minutes) |
| PID | 1638 |
| Path | /Users/USER/*/codex (CLI npm binary) |
| Samples | 16 |
| Threads | 9 |
| Disk writes | 2,147.96 MB over 6,306s (340 KB/s avg) |
| Memory | 59.79 MB → 102.76 MB (+42.97 MB) |
| Stack | Multiple concurrent write paths (plugin cache, MCP cache, SQLite logs) |
Key: 9 threads all concurrently writing to the same underlying database file. Three distinct write paths identified.
Root Cause Analysis
1. Unlimited log accumulation
Codex writes all operational logs to ~/.codex/logs_2.sqlite with no rotation, size cap, or retention policy:
$ sqlite3 ~/.codex/logs_2.sqlite "SELECT COUNT(*) FROM logs;"
92060
After extended usage, the database grew to 229 MB with 92,060 log entries, making every write operation pathologically slow.
2. SQLite WAL checkpoint contention
Codex uses SQLite in WAL (Write-Ahead Logging) mode. Under WAL mode:
- All writes go to the WAL journal file first (fast)
- Periodically, SQLite triggers
wal_checkpointto merge the WAL back into the main database file (slow, especially with 229MB of data)
The wal_checkpoint_v2 call appeared in 36% of all sampled frames in Event 1:
sqlite3_step + 642
→ ??? (codex + ...)
→ sqlite3_wal_checkpoint_v2 + 840 ← checkpoint merge back to main DB
→ ??? (codex + ...)
→ pwrite + 10 (libsystem_kernel)
3. Multiple concurrent write paths without adequate coordination
From the sampled stacks, at least 3 distinct write paths were executing simultaneously on the same logs_2.sqlite:
| Path | Rust function | Operation |
|---|---|---|
| A — Plugin cache | write_cached_global_directory_plugins |
Writes plugin catalog cache |
| B — MCP tools cache | write_cached_codex_apps_tools_if_needed |
Writes MCP app tool definitions |
| C — Log writer | sqlite3_exec / sqlite3_step (via sqlx ORM) |
Inserts log entries into logs_2 table |
All three paths ultimately call pwrite() on the same file descriptor for logs_2.sqlite.
4. SQLITE_BUSY → Rust panic → abort → SIGTRAP
The complete call chain:
Multiple threads try to write to logs_2.sqlite
↓
SQLite WAL lock is held by thread C (checkpointing)
↓
Thread A/B receive SQLITE_BUSY or timeout
↓
Rust code calls .unwrap() on the error Result
(codex is compiled with panic = "abort")
↓
panic!("config persistence task panicked: ...")
↓
std::process::abort()
↓
__builtin_trap()
↓
Kernel delivers SIGTRAP (signal 5) to the process
↓
Node.js wrapper (bin/codex.js) detects child exited by signal,
kills itself with the same signal
↓
zsh captures: "zsh: trace trap codex"
Evidence for the panic string: The Rust binary contains these embedded panic messages:
$ strings codex-vendor-binary | grep panic
"login server thread panicked: "
"config persistence task panicked: " ← MATCH: exact crash source
"aborting due to panic at "
"thread panicked while processing panic. aborting."
5. macOS disk-write resource monitoring as detection mechanism
All three crash reports were generated by macOS’s I/O resource limit monitor. The system detected that codex exceeded the per-process write throttle:
Writes limit: 2,147.48 MB over 86,400 seconds (24.86 KB/s avg)
Codex actual: 2,147.96 MB over ~6,306 seconds (340 KB/s avg = 13.7× limit)
The diagnostic was triggered because codex’s write rate was 13.7× the allowed average.
Reproduction
Natural reproduction
- Use
codexCLI regularly for several weeks, allowinglogs_2.sqliteto accumulate to >200MB - Run a codex session with moderate activity (plugin installations, MCP interactions)
- Crash manifests as
zsh: trace trap codexafter 30–105 minutes
Simulated reproduction (faster)
- Manually inflate
~/.codex/logs_2.sqliteto >200MB with synthetic data - Run
codexand observe concurrent write contention during WAL checkpoint - The panic occurs in
config persistence taskwhenunwrap()receives an error from SQLite
Impact
- User impact: CLI becomes non-functional after extended usage — sessions crash mid-operation
- Data risk: Crash during SQLite write could leave database in inconsistent state (though WAL mode provides some protection)
- Frequency: Once the database reaches ~200MB, crashes become inevitable within 30–105 minutes of active use
- Scope: Affects any codex CLI installation used for weeks/months without log cleanup
Suggested Fix
Immediate mitigations (code-level)
- Add SQLite write retry with exponential backoff instead of
unwrap()on write errors inconfig persistence task,plugin cache, andMCP cachemodules - Replace
unwrap()/expect()with proper error handling (match/?) in async tasks, so transient database errors don’t cause a process abort - Use
panic = "unwind"instead ofpanic = "abort"so panics can be caught incatch_unwindboundaries within tokio task spawns
Long-term fixes (architecture-level)
- Implement log rotation — set a maximum size (e.g., 50MB or 10,000 rows) and auto-vacuum old entries
- Use separate SQLite connection pools or databases so log writing, plugin cache, and MCP cache don’t contend on the same WAL lock
- Schedule WAL checkpoints carefully — use
PRAGMA wal_autocheckpointto control checkpoint frequency, or run checkpoint in a dedicated background thread with lower priority - Consider switching logs to append-only format (JSONL files instead of SQLite) to avoid SQLite contention entirely
Workaround for users
# Delete the oversized log database (codex will recreate it)
rm ~/.codex/logs_2.sqlite ~/.codex/logs_2.sqlite-shm ~/.codex/logs_2.sqlite-wal
# Prevent future accumulation (add to crontab):
# 0 0 * * 0 find ~/.codex -name "*.sqlite-wal" -size +50M -delete && \
# find ~/.codex -name "*.sqlite" -size +100M -exec sqlite3 {} "PRAGMA wal_checkpoint(TRUNCATE);" \;
Appendix A: Native Binary Details
Path: vendor/x86_64-apple-darwin/bin/codex
Type: Mach-O 64-bit executable x86_64
Min OS: macOS 10.12
SDK: 15.5
Rustc: 1.95.0 (59807616e1fa2540724bfbac14d7976d7e4a3860)
Linked Frameworks
AppKit, CoreGraphics, IOKit, CoreFoundation, CoreServices,
SystemConfiguration, Foundation, Security,
libSystem.B.dylib, libobjc.A.dylib, libz.1.dylib, libiconv.2.dylib
Bundled Tools
vendor/x86_64-apple-darwin/codex-path/rg ← ripgrep (Mach-O x86_64)
vendor/x86_64-apple-darwin/codex-resources/zsh/bin/zsh ← bundled zsh (Mach-O x86_64)
Appendix B: macOS Diagnostic Report Summary
All three .diag files are Microstackshots triggered by Event: disk writes with the same write limit (2,147.48 MB). The consistent pattern across all three reports:
- All threads blocked in
write()/pwrite()syscalls fromlibsystem_kernel.dylib - Write paths converging on SQLite operations (
sqlite3_exec,sqlite3_step,sqlite3_wal_checkpoint_v2) - Rust async runtime (
tokio::runtime::scheduler::multi_thread::worker::run) spawning multiple write-heavy tasks - Process footprint growing by 42~103MB over the session duration
Report generated on 2026-06-20
更多推荐




所有评论(0)