今日发现的codex崩溃bug，根因分析与解决方法

2401_86255357

19人浏览 · 2026-06-21 00:13:41

2401_86255357 · 2026-06-21 00:13:41 发布

LT;DR:
这几天codex一直崩溃推出，claudecode分析之后发现了codex的bug
解决：
备份一下codex的日志数据库，之后清空日志数据库，现在还没有收到官方的反馈

Bug Report: codex CLI crashes with `zsh: trace trap` due to SQLite WAL checkpoint contention

Metadata

Field	Value
Bug title	CLI crashes with SIGTRAP (`trace trap`) when `logs_2.sqlite` exceeds ~200MB
Product	OpenAI Codex CLI (`@openai/codex` npm package)
Version	`0.141.0`
OS	macOS 14.5 (23F79) x86_64
Hardware	MacBookPro16,1 — Intel Core i7-9750H
Severity	High — renders CLI unusable after extended usage
Date discovered	2026-06-19 ~ 2026-06-20 (3 separate crash events)

Summary

The codex CLI binary (Rust native executable spawned by the Node.js wrapper) crashes with SIGTRAP when the logs_2.sqlite database grows beyond ~200MB. The root cause is a Rust panic in a background task triggered by write contention between multiple concurrent threads writing to the same SQLite database under WAL mode, compounded by heavy WAL checkpoint I/O. The panic triggers abort() (because the binary is compiled with panic = "abort"), which on macOS generates SIGTRAP, killing the process and producing zsh: trace trap codex in the terminal.

Three crash events were recorded by macOS as .diag files in /Library/Logs/DiagnosticReports/, all triggered by the system disk writes resource monitor after exceeding 2,147 MB of file-backed memory writes within a single session.

Environment

System

ProductName:            macOS
ProductVersion:         14.5
BuildVersion:           23F79
Architecture:           x86_64 (Intel)
CPU:                    Intel Core i7-9750H @ 2.60GHz
Active CPUs:            12

codex installation

Package: @openai/codex@0.141.0 (installed globally via npm)
Node.js: v23.11.0 (Homebrew, /usr/local/Cellar/node/23.11.0/)
Installation path: /usr/local/Cellar/node/23.11.0/lib/node_modules/@openai/codex/
Architecture: Two-layer —
- Wrapper: Node.js ESM script (bin/codex.js) — thin Node.js launcher
- Native binary: Rust-compiled Mach-O 64-bit executable at node_modules/@openai/codex-darwin-x64/vendor/x86_64-apple-darwin/bin/codex
- Rust toolchain: 1.95.0
- Compilation mode: panic = "abort" (confirmed via embedded strings)
Code signature:
- Identifier: codex
- Team: OpenAI OpCo, LLC (2DC432GLL2)
- Format: Mach-O thin (x86_64), Hardened Runtime enabled
- Runtime Version: 15.5.0

SQLite databases in `~/.codex/`

File	Size	Description
`logs_2.sqlite`	229,408,768 bytes (~229 MB)	Log entries (92,060 rows)
`logs_2.sqlite-wal`	16,463,552 bytes (~16 MB)	WAL journal
`logs_2.sqlite-shm`	32,768 bytes	Shared memory index
`state_5.sqlite`	1,335,296 bytes	Session/state data
`state_5.sqlite-wal`	4,152,992 bytes	WAL journal
`goals_1.sqlite`	24,576 bytes	Goals data
`memories_1.sqlite`	40,960 bytes	Memories data

Crash Events — Timeline

Event 1 (earliest, most severe)

Field	Value
Time	2026-06-19 05:21:06 ~ 06:19:29 (+0800)
Duration	3,502 seconds (~58 minutes)
PID	65952
Path	`/Applications/Codex.app/Contents/Resources/codex`
Samples	110 (100% in kernel-mode write)
Threads	8
Disk writes	2,147.52 MB over 3,502s (613 KB/s avg)
CPU time	52.853s
Memory	60.20 MB → 130.43 MB (+70.23 MB)
Stack	67% `sqlite3_exec → sqlite3_step → pwrite` 36% inside `sqlite3_wal_checkpoint_v2 → pwrite`

Key: WAL checkpoint consumed 36% of all CPU samples — checkpointing a 229MB database was the bottleneck.

Event 2

Field	Value
Time	2026-06-19 21:20:51 ~ 21:57:14 (+0800)
Duration	2,184 seconds (~36 minutes)
PID	19112
Path	`/Applications/Codex.app/Contents/MacOS/Codex` (Desktop app)
Samples	44
Threads	7
Disk writes	2,147.54 MB over 2,184s (983 KB/s avg)
Memory	96.77 MB → 199.92 MB (+103 MB, max 486 MB)
Stack	93% `uv_cancel → uv__fs_post → write` (libuv async filesystem)

Key: Desktop app also triggered the same disk-write limit, suggesting the shared ~/.codex/ state directory was the common factor.

Event 3 (latest)

Field	Value
Time	2026-06-20 01:31:12 ~ 03:11:21 (+0800)
Duration	6,306 seconds (~105 minutes)
PID	1638
Path	`/Users/USER/*/codex` (CLI npm binary)
Samples	16
Threads	9
Disk writes	2,147.96 MB over 6,306s (340 KB/s avg)
Memory	59.79 MB → 102.76 MB (+42.97 MB)
Stack	Multiple concurrent write paths (plugin cache, MCP cache, SQLite logs)

Key: 9 threads all concurrently writing to the same underlying database file. Three distinct write paths identified.

Root Cause Analysis

1. Unlimited log accumulation

Codex writes all operational logs to ~/.codex/logs_2.sqlite with no rotation, size cap, or retention policy:

$ sqlite3 ~/.codex/logs_2.sqlite "SELECT COUNT(*) FROM logs;"
92060

After extended usage, the database grew to 229 MB with 92,060 log entries, making every write operation pathologically slow.

2. SQLite WAL checkpoint contention

Codex uses SQLite in WAL (Write-Ahead Logging) mode. Under WAL mode:

All writes go to the WAL journal file first (fast)
Periodically, SQLite triggers wal_checkpoint to merge the WAL back into the main database file (slow, especially with 229MB of data)

The wal_checkpoint_v2 call appeared in 36% of all sampled frames in Event 1:

sqlite3_step + 642
  → ??? (codex + ...)
    → sqlite3_wal_checkpoint_v2 + 840   ← checkpoint merge back to main DB
      → ??? (codex + ...)
        → pwrite + 10 (libsystem_kernel)

3. Multiple concurrent write paths without adequate coordination

From the sampled stacks, at least 3 distinct write paths were executing simultaneously on the same logs_2.sqlite:

Path	Rust function	Operation
A — Plugin cache	`write_cached_global_directory_plugins`	Writes plugin catalog cache
B — MCP tools cache	`write_cached_codex_apps_tools_if_needed`	Writes MCP app tool definitions
C — Log writer	`sqlite3_exec` / `sqlite3_step` (via `sqlx` ORM)	Inserts log entries into `logs_2` table

All three paths ultimately call pwrite() on the same file descriptor for logs_2.sqlite.

4. SQLITE_BUSY → Rust panic → abort → SIGTRAP

The complete call chain:

Multiple threads try to write to logs_2.sqlite
  ↓
SQLite WAL lock is held by thread C (checkpointing)
  ↓
Thread A/B receive SQLITE_BUSY or timeout
  ↓
Rust code calls .unwrap() on the error Result
(codex is compiled with panic = "abort")
  ↓
panic!("config persistence task panicked: ...")
  ↓
std::process::abort()
  ↓
__builtin_trap()
  ↓
Kernel delivers SIGTRAP (signal 5) to the process
  ↓
Node.js wrapper (bin/codex.js) detects child exited by signal,
kills itself with the same signal
  ↓
zsh captures: "zsh: trace trap  codex"

Evidence for the panic string: The Rust binary contains these embedded panic messages:

$ strings codex-vendor-binary | grep panic
"login server thread panicked: "
"config persistence task panicked: "        ← MATCH: exact crash source
"aborting due to panic at "
"thread panicked while processing panic. aborting."

5. macOS disk-write resource monitoring as detection mechanism

All three crash reports were generated by macOS’s I/O resource limit monitor. The system detected that codex exceeded the per-process write throttle:

Writes limit:  2,147.48 MB over 86,400 seconds (24.86 KB/s avg)
Codex actual:  2,147.96 MB over ~6,306 seconds (340 KB/s avg = 13.7× limit)

The diagnostic was triggered because codex’s write rate was 13.7× the allowed average.

Reproduction

Natural reproduction

Use codex CLI regularly for several weeks, allowing logs_2.sqlite to accumulate to >200MB
Run a codex session with moderate activity (plugin installations, MCP interactions)
Crash manifests as zsh: trace trap codex after 30–105 minutes

Simulated reproduction (faster)

Manually inflate ~/.codex/logs_2.sqlite to >200MB with synthetic data
Run codex and observe concurrent write contention during WAL checkpoint
The panic occurs in config persistence task when unwrap() receives an error from SQLite

Impact

User impact: CLI becomes non-functional after extended usage — sessions crash mid-operation
Data risk: Crash during SQLite write could leave database in inconsistent state (though WAL mode provides some protection)
Frequency: Once the database reaches ~200MB, crashes become inevitable within 30–105 minutes of active use
Scope: Affects any codex CLI installation used for weeks/months without log cleanup

Suggested Fix

Immediate mitigations (code-level)

Add SQLite write retry with exponential backoff instead of unwrap() on write errors in config persistence task, plugin cache, and MCP cache modules
Replace unwrap() / expect() with proper error handling (match / ?) in async tasks, so transient database errors don’t cause a process abort
Use panic = "unwind" instead of panic = "abort" so panics can be caught in catch_unwind boundaries within tokio task spawns

Long-term fixes (architecture-level)

Implement log rotation — set a maximum size (e.g., 50MB or 10,000 rows) and auto-vacuum old entries
Use separate SQLite connection pools or databases so log writing, plugin cache, and MCP cache don’t contend on the same WAL lock
Schedule WAL checkpoints carefully — use PRAGMA wal_autocheckpoint to control checkpoint frequency, or run checkpoint in a dedicated background thread with lower priority
Consider switching logs to append-only format (JSONL files instead of SQLite) to avoid SQLite contention entirely

Workaround for users

# Delete the oversized log database (codex will recreate it)
rm ~/.codex/logs_2.sqlite ~/.codex/logs_2.sqlite-shm ~/.codex/logs_2.sqlite-wal

# Prevent future accumulation (add to crontab):
# 0 0 * * 0 find ~/.codex -name "*.sqlite-wal" -size +50M -delete && \
#          find ~/.codex -name "*.sqlite" -size +100M -exec sqlite3 {} "PRAGMA wal_checkpoint(TRUNCATE);" \;

Appendix A: Native Binary Details

Path:    vendor/x86_64-apple-darwin/bin/codex
Type:    Mach-O 64-bit executable x86_64
Min OS:  macOS 10.12
SDK:     15.5
Rustc:   1.95.0 (59807616e1fa2540724bfbac14d7976d7e4a3860)

Linked Frameworks

AppKit, CoreGraphics, IOKit, CoreFoundation, CoreServices,
SystemConfiguration, Foundation, Security,
libSystem.B.dylib, libobjc.A.dylib, libz.1.dylib, libiconv.2.dylib

Bundled Tools

vendor/x86_64-apple-darwin/codex-path/rg           ← ripgrep (Mach-O x86_64)
vendor/x86_64-apple-darwin/codex-resources/zsh/bin/zsh  ← bundled zsh (Mach-O x86_64)

Appendix B: macOS Diagnostic Report Summary

All three .diag files are Microstackshots triggered by Event: disk writes with the same write limit (2,147.48 MB). The consistent pattern across all three reports:

All threads blocked in write() / pwrite() syscalls from libsystem_kernel.dylib
Write paths converging on SQLite operations (sqlite3_exec, sqlite3_step, sqlite3_wal_checkpoint_v2)
Rust async runtime (tokio::runtime::scheduler::multi_thread::worker::run) spawning multiple write-heavy tasks
Process footprint growing by 42~103MB over the session duration