LLM Evaluation
Fallax
Multi-step reasoning evaluation and benchmark suite for language models. Surfaces failure modes that single-turn benchmarks miss.
View on GitHubGet StartedMulti-step
Reasoning Tasks
Step-level
Correctness Scoring
4+
Reasoning Domains
Python
3.10+
Evaluation Domains
Logical Deduction
Chained inference tasks requiring models to maintain and apply intermediate conclusions across multiple reasoning steps.
Mathematical Proof
Step-by-step proof tasks that score each derivation step independently, revealing where reasoning breaks down.
Causal Inference
Counterfactual and interventional reasoning tasks designed to distinguish causal from correlational thinking.
Compositional Planning
Tasks requiring multi-stage planning where later steps depend on earlier choices — exposing lookahead failures.
Quick Start
pip install -e .
from fallax import Benchmark, Scorer
bench = Benchmark.load("logical_deduction")
scorer = Scorer(model="gpt-4o")
results = scorer.score(bench)
print(results.step_accuracy)