Aurelius is a decentralized protocol for surfacing and verifying alignment failures in large language models. It transforms adversarial prompts, model outputs, scoring artifacts, and interpretability data into structured, reproducible datasets, all without relying on centralized oversight.
Built on the Bittensor network, Aurelius incentivizes a peer-to-peer ecosystem of adversarial prompters (miners), independent auditors (validators), and a dynamic rules layer known as the Tribunate. Together, these agents generate alignment pressure through contestation, not consensus, creating artifacts that can be used to train, fine-tune, or audit models in a reproducible and interpretable way.
Modern AI systems often appear safe on the surface, but fail to reason honestly under pressure. Existing alignment methods rely heavily on centralized oversight, fixed reward models, and shallow behavioral signals, suppressing disagreement and failing to reveal model internals. This leads to alignment faking, brittle safety filters, and unverifiable outputs.
Aurelius challenges this paradigm by enabling any motivated agent to expose failure, verify it independently, and turn it into usable data, all while preserving reasoning, scoring methods, and provenance through cryptographic commitments.