[2410.21514v1] Sabotage Evaluations for Frontier Models
[2410.21514v1] Sabotage Evaluations for Frontier Models
[2410.21514v1] Sabotage Evaluations for Frontier Models - arXiv
For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor ...
Sabotage Evaluations for Frontier Models - arXiv
For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, ...
iFlow - Sabotage Evaluations for Frontier Models
arXiv:2410.21514v1 [cs.LG] 28 Oct 2024. Evaluation task Oversight Human decision sabotage: Steer humans to bad decisions without appearing suspicious. Time ...