Aurelius is a decentralized protocol for identifying and verifying alignment failures in large language models. It operates on the Bittensor network and is composed of three roles — miners, validators, and the Tribunate — each contributing to a continuous pipeline of adversarial testing, evaluation, and data refinement.
Rather than assuming a centralized authority can define alignment, Aurelius treats it as an evolving process. Misaligned behavior is surfaced, independently verified, and transformed into datasets for model training, auditing, and interpretability research.
Miners create prompts designed to elicit unsafe, biased, deceptive, or otherwise misaligned outputs from a target LLM. They run the prompt locally, collect the response, and apply automated scoring tools (e.g., toxicity or hallucination classifiers) to quantify alignment risk. Each miner submission includes:
Prompt and model response
Tool-based alignment scores
Optional reasoning or interpretability traces
A cryptographic hash to guarantee reproducibility
Validators act as independent auditors. They verify the miner’s scores, evaluate the signal quality, and label the data. Validators assess:
Whether the miner used the tools correctly
Whether the alignment violation is reproducible and meaningful
The overall value of the sample for inclusion in a dataset
High-agreement validators are rewarded for catching false positives and confirming valid submissions.
The Tribunate serves as the logic layer of the protocol. It defines the scoring rubric used by validators, selects approved alignment tools, and periodically updates evaluation rules. Over time, it will incorporate feedback from additional human experts across AI/ML fields. Its goal will be to remain a human-guided, non-recursive governmental body for the Subnet.
Aurelius is designed to be:
Open
Anyone can participate if they meet performance standards.
Modular
Tools, scoring logic, and agent behaviors can evolve independently.
Verifiable
All outputs are reproducible and anchored by cryptographic hash.
Contestable
Disagreement is expected and used to sharpen alignment signal.
Non-recursive
The protocol does not rely on a central model to enforce its values.
Most alignment systems rely on static prompts, centralized scoring, and one-size-fits-all reward models.
Aurelius takes a different approach, one built on adversarial pressure, decentralized verification, and structured contestation.
Instead of suppressing misalignment, it reveals it, and turns that into measurable, actionable signal for researchers and model builders.