| Title |
PHI De-identification Decision Circuit Analysis of Gemma-3-1B via Cross-layer Transcoder |
| Authors |
안재신(Jaesin Ahn) ; 배준현(Jun-Hyun Bae) ; 이제경(Jekyung Lee) ; 정희철(Heechul Jung) |
| Keywords |
PHI de-identification; Mechanistic interpretability; Large language model |
| Abstract |
Large language models deployed for clinical de-identification leak protected health information (PHI) despite explicit masking instructions, but the internal causes behind such failures remain unexplored. We analyze the PHI masking decision circuit of Gemma-3-1B using per-head analysis, QK attribution, and cross-layer feature tracing. Across 100 documents and 1,014 PHI entities, we identify a TAG circuit promoting PHI masking and a LEAK circuit causing PHI leakage with distinct compositions, along with another TAG mechanism driven by general linguistic features. We further find that the LEAK head acts as a copy head that reproduces both PHI and non-PHI text, leading to a trade-off between leak reduction and output fidelity when suppressing the LEAK circuit. Causal interventions show that jointly manipulating TAG and LEAK circuits reduces leak rate by 10.09pp while preserving output quality (J=0.3316), and that the causally effective components converge to a small subset of the pre-identified circuit. |