LT;DR:
这几天codex一直崩溃推出,claudecode分析之后发现了codex的bug
解决:
备份一下codex的日志数据库,之后清空日志数据库,现在还没有收到官方的反馈

Bug Report: codex CLI crashes with zsh: trace trap due to SQLite WAL checkpoint contention

Metadata

Field Value
Bug title CLI crashes with SIGTRAP (trace trap) when logs_2.sqlite exceeds ~200MB
Product OpenAI Codex CLI (@openai/codex npm package)
Version 0.141.0
OS macOS 14.5 (23F79) x86_64
Hardware MacBookPro16,1 — Intel Core i7-9750H
Severity High — renders CLI unusable after extended usage
Date discovered 2026-06-19 ~ 2026-06-20 (3 separate crash events)

Summary

The codex CLI binary (Rust native executable spawned by the Node.js wrapper) crashes with SIGTRAP when the logs_2.sqlite database grows beyond ~200MB. The root cause is a Rust panic in a background task triggered by write contention between multiple concurrent threads writing to the same SQLite database under WAL mode, compounded by heavy WAL checkpoint I/O. The panic triggers abort() (because the binary is compiled with panic = "abort"), which on macOS generates SIGTRAP, killing the process and producing zsh: trace trap codex in the terminal.

Three crash events were recorded by macOS as .diag files in /Library/Logs/DiagnosticReports/, all triggered by the system disk writes resource monitor after exceeding 2,147 MB of file-backed memory writes within a single session.


Environment

System

ProductName:            macOS
ProductVersion:         14.5
BuildVersion:           23F79
Architecture:           x86_64 (Intel)
CPU:                    Intel Core i7-9750H @ 2.60GHz
Active CPUs:            12

codex installation

  • Package: @openai/codex@0.141.0 (installed globally via npm)
  • Node.js: v23.11.0 (Homebrew, /usr/local/Cellar/node/23.11.0/)
  • Installation path: /usr/local/Cellar/node/23.11.0/lib/node_modules/@openai/codex/
  • Architecture: Two-layer —
    • Wrapper: Node.js ESM script (bin/codex.js) — thin Node.js launcher
    • Native binary: Rust-compiled Mach-O 64-bit executable at node_modules/@openai/codex-darwin-x64/vendor/x86_64-apple-darwin/bin/codex
    • Rust toolchain: 1.95.0
    • Compilation mode: panic = "abort" (confirmed via embedded strings)
  • Code signature:
    • Identifier: codex
    • Team: OpenAI OpCo, LLC (2DC432GLL2)
    • Format: Mach-O thin (x86_64), Hardened Runtime enabled
    • Runtime Version: 15.5.0

SQLite databases in ~/.codex/

File Size Description
logs_2.sqlite 229,408,768 bytes (~229 MB) Log entries (92,060 rows)
logs_2.sqlite-wal 16,463,552 bytes (~16 MB) WAL journal
logs_2.sqlite-shm 32,768 bytes Shared memory index
state_5.sqlite 1,335,296 bytes Session/state data
state_5.sqlite-wal 4,152,992 bytes WAL journal
goals_1.sqlite 24,576 bytes Goals data
memories_1.sqlite 40,960 bytes Memories data

Crash Events — Timeline

Event 1 (earliest, most severe)

Field Value
Time 2026-06-19 05:21:06 ~ 06:19:29 (+0800)
Duration 3,502 seconds (~58 minutes)
PID 65952
Path /Applications/Codex.app/Contents/Resources/codex
Samples 110 (100% in kernel-mode write)
Threads 8
Disk writes 2,147.52 MB over 3,502s (613 KB/s avg)
CPU time 52.853s
Memory 60.20 MB → 130.43 MB (+70.23 MB)
Stack 67% sqlite3_exec → sqlite3_step → pwrite
36% inside sqlite3_wal_checkpoint_v2 → pwrite

Key: WAL checkpoint consumed 36% of all CPU samples — checkpointing a 229MB database was the bottleneck.

Event 2

Field Value
Time 2026-06-19 21:20:51 ~ 21:57:14 (+0800)
Duration 2,184 seconds (~36 minutes)
PID 19112
Path /Applications/Codex.app/Contents/MacOS/Codex (Desktop app)
Samples 44
Threads 7
Disk writes 2,147.54 MB over 2,184s (983 KB/s avg)
Memory 96.77 MB → 199.92 MB (+103 MB, max 486 MB)
Stack 93% uv_cancel → uv__fs_post → write (libuv async filesystem)

Key: Desktop app also triggered the same disk-write limit, suggesting the shared ~/.codex/ state directory was the common factor.

Event 3 (latest)

Field Value
Time 2026-06-20 01:31:12 ~ 03:11:21 (+0800)
Duration 6,306 seconds (~105 minutes)
PID 1638
Path /Users/USER/*/codex (CLI npm binary)
Samples 16
Threads 9
Disk writes 2,147.96 MB over 6,306s (340 KB/s avg)
Memory 59.79 MB → 102.76 MB (+42.97 MB)
Stack Multiple concurrent write paths (plugin cache, MCP cache, SQLite logs)

Key: 9 threads all concurrently writing to the same underlying database file. Three distinct write paths identified.


Root Cause Analysis

1. Unlimited log accumulation

Codex writes all operational logs to ~/.codex/logs_2.sqlite with no rotation, size cap, or retention policy:

$ sqlite3 ~/.codex/logs_2.sqlite "SELECT COUNT(*) FROM logs;"
92060

After extended usage, the database grew to 229 MB with 92,060 log entries, making every write operation pathologically slow.

2. SQLite WAL checkpoint contention

Codex uses SQLite in WAL (Write-Ahead Logging) mode. Under WAL mode:

  • All writes go to the WAL journal file first (fast)
  • Periodically, SQLite triggers wal_checkpoint to merge the WAL back into the main database file (slow, especially with 229MB of data)

The wal_checkpoint_v2 call appeared in 36% of all sampled frames in Event 1:

sqlite3_step + 642
  → ??? (codex + ...)
    → sqlite3_wal_checkpoint_v2 + 840   ← checkpoint merge back to main DB
      → ??? (codex + ...)
        → pwrite + 10 (libsystem_kernel)

3. Multiple concurrent write paths without adequate coordination

From the sampled stacks, at least 3 distinct write paths were executing simultaneously on the same logs_2.sqlite:

Path Rust function Operation
A — Plugin cache write_cached_global_directory_plugins Writes plugin catalog cache
B — MCP tools cache write_cached_codex_apps_tools_if_needed Writes MCP app tool definitions
C — Log writer sqlite3_exec / sqlite3_step (via sqlx ORM) Inserts log entries into logs_2 table

All three paths ultimately call pwrite() on the same file descriptor for logs_2.sqlite.

4. SQLITE_BUSY → Rust panic → abort → SIGTRAP

The complete call chain:

Multiple threads try to write to logs_2.sqlite
  ↓
SQLite WAL lock is held by thread C (checkpointing)
  ↓
Thread A/B receive SQLITE_BUSY or timeout
  ↓
Rust code calls .unwrap() on the error Result
(codex is compiled with panic = "abort")
  ↓
panic!("config persistence task panicked: ...")
  ↓
std::process::abort()
  ↓
__builtin_trap()
  ↓
Kernel delivers SIGTRAP (signal 5) to the process
  ↓
Node.js wrapper (bin/codex.js) detects child exited by signal,
kills itself with the same signal
  ↓
zsh captures: "zsh: trace trap  codex"

Evidence for the panic string: The Rust binary contains these embedded panic messages:

$ strings codex-vendor-binary | grep panic
"login server thread panicked: "
"config persistence task panicked: "        ← MATCH: exact crash source
"aborting due to panic at "
"thread panicked while processing panic. aborting."

5. macOS disk-write resource monitoring as detection mechanism

All three crash reports were generated by macOS’s I/O resource limit monitor. The system detected that codex exceeded the per-process write throttle:

Writes limit:  2,147.48 MB over 86,400 seconds (24.86 KB/s avg)
Codex actual:  2,147.96 MB over ~6,306 seconds (340 KB/s avg = 13.7× limit)

The diagnostic was triggered because codex’s write rate was 13.7× the allowed average.


Reproduction

Natural reproduction

  1. Use codex CLI regularly for several weeks, allowing logs_2.sqlite to accumulate to >200MB
  2. Run a codex session with moderate activity (plugin installations, MCP interactions)
  3. Crash manifests as zsh: trace trap codex after 30–105 minutes

Simulated reproduction (faster)

  1. Manually inflate ~/.codex/logs_2.sqlite to >200MB with synthetic data
  2. Run codex and observe concurrent write contention during WAL checkpoint
  3. The panic occurs in config persistence task when unwrap() receives an error from SQLite

Impact

  • User impact: CLI becomes non-functional after extended usage — sessions crash mid-operation
  • Data risk: Crash during SQLite write could leave database in inconsistent state (though WAL mode provides some protection)
  • Frequency: Once the database reaches ~200MB, crashes become inevitable within 30–105 minutes of active use
  • Scope: Affects any codex CLI installation used for weeks/months without log cleanup

Suggested Fix

Immediate mitigations (code-level)

  1. Add SQLite write retry with exponential backoff instead of unwrap() on write errors in config persistence task, plugin cache, and MCP cache modules
  2. Replace unwrap() / expect() with proper error handling (match / ?) in async tasks, so transient database errors don’t cause a process abort
  3. Use panic = "unwind" instead of panic = "abort" so panics can be caught in catch_unwind boundaries within tokio task spawns

Long-term fixes (architecture-level)

  1. Implement log rotation — set a maximum size (e.g., 50MB or 10,000 rows) and auto-vacuum old entries
  2. Use separate SQLite connection pools or databases so log writing, plugin cache, and MCP cache don’t contend on the same WAL lock
  3. Schedule WAL checkpoints carefully — use PRAGMA wal_autocheckpoint to control checkpoint frequency, or run checkpoint in a dedicated background thread with lower priority
  4. Consider switching logs to append-only format (JSONL files instead of SQLite) to avoid SQLite contention entirely

Workaround for users

# Delete the oversized log database (codex will recreate it)
rm ~/.codex/logs_2.sqlite ~/.codex/logs_2.sqlite-shm ~/.codex/logs_2.sqlite-wal

# Prevent future accumulation (add to crontab):
# 0 0 * * 0 find ~/.codex -name "*.sqlite-wal" -size +50M -delete && \
#          find ~/.codex -name "*.sqlite" -size +100M -exec sqlite3 {} "PRAGMA wal_checkpoint(TRUNCATE);" \;

Appendix A: Native Binary Details

Path:    vendor/x86_64-apple-darwin/bin/codex
Type:    Mach-O 64-bit executable x86_64
Min OS:  macOS 10.12
SDK:     15.5
Rustc:   1.95.0 (59807616e1fa2540724bfbac14d7976d7e4a3860)

Linked Frameworks

AppKit, CoreGraphics, IOKit, CoreFoundation, CoreServices,
SystemConfiguration, Foundation, Security,
libSystem.B.dylib, libobjc.A.dylib, libz.1.dylib, libiconv.2.dylib

Bundled Tools

vendor/x86_64-apple-darwin/codex-path/rg           ← ripgrep (Mach-O x86_64)
vendor/x86_64-apple-darwin/codex-resources/zsh/bin/zsh  ← bundled zsh (Mach-O x86_64)

Appendix B: macOS Diagnostic Report Summary

All three .diag files are Microstackshots triggered by Event: disk writes with the same write limit (2,147.48 MB). The consistent pattern across all three reports:

  1. All threads blocked in write() / pwrite() syscalls from libsystem_kernel.dylib
  2. Write paths converging on SQLite operations (sqlite3_exec, sqlite3_step, sqlite3_wal_checkpoint_v2)
  3. Rust async runtime (tokio::runtime::scheduler::multi_thread::worker::run) spawning multiple write-heavy tasks
  4. Process footprint growing by 42~103MB over the session duration

Report generated on 2026-06-20

Logo

汇聚全球AI编程工具,助力开发者即刻编程。

更多推荐