Protocol Architecture

Aurelius is a decentralized protocol for identifying and verifying alignment failures in large language models. It operates on the Bittensor network and is composed of three roles — miners, validators, and the Tribunate — each contributing to a continuous pipeline of adversarial testing, evaluation, and data refinement.

Rather than assuming a centralized authority can define alignment, Aurelius treats it as an evolving process. Misaligned behavior is surfaced, independently verified, and transformed into datasets for model training, auditing, and interpretability research.

Core Agents and Workflow

Miners

Miners create prompts designed to elicit unsafe, biased, deceptive, or otherwise misaligned outputs from a target LLM. They run the prompt locally, collect the response, and apply automated scoring tools (e.g., toxicity or hallucination classifiers) to quantify alignment risk. Each miner submission includes:

Prompt and model response

Tool-based alignment scores

Optional reasoning or interpretability traces

A cryptographic hash to guarantee reproducibility

Validators

Validators act as independent auditors. They verify the miner’s scores, evaluate the signal quality, and label the data. Validators assess:

Whether the miner used the tools correctly

Whether the alignment violation is reproducible and meaningful

The overall value of the sample for inclusion in a dataset

High-agreement validators are rewarded for catching false positives and confirming valid submissions.

The Tribunate

The Tribunate serves as the logic layer of the protocol. It defines the scoring rubric used by validators, selects approved alignment tools, and periodically updates evaluation rules. Over time, it will incorporate feedback from additional human experts across AI/ML fields. Its goal will be to remain a human-guided, non-recursive governmental body for the Subnet.