Fallax
LLM Evaluation

Fallax

Multi-step reasoning evaluation and benchmark suite for language models. Surfaces failure modes that single-turn benchmarks miss.

View on GitHubGet Started
Multi-step
Reasoning Tasks
Step-level
Correctness Scoring
4+
Reasoning Domains
Python
3.10+

Evaluation Domains

Logical Deduction

Chained inference tasks requiring models to maintain and apply intermediate conclusions across multiple reasoning steps.

Mathematical Proof

Step-by-step proof tasks that score each derivation step independently, revealing where reasoning breaks down.

Causal Inference

Counterfactual and interventional reasoning tasks designed to distinguish causal from correlational thinking.

Compositional Planning

Tasks requiring multi-stage planning where later steps depend on earlier choices — exposing lookahead failures.

Quick Start

pip install -e .

from fallax import Benchmark, Scorer

bench = Benchmark.load("logical_deduction")
scorer = Scorer(model="gpt-4o")
results = scorer.score(bench)
print(results.step_accuracy)