LLM Evaluation

Fallax

Multi-step reasoning evaluation and benchmark suite for language models. Surfaces failure modes that single-turn benchmarks miss.

View on GitHub Get Started

Multi-step

Reasoning Tasks

Step-level

Correctness Scoring

Reasoning Domains

Python

3.10+

Evaluation Domains

Logical Deduction

Chained inference tasks requiring models to maintain and apply intermediate conclusions across multiple reasoning steps.

Mathematical Proof

Step-by-step proof tasks that score each derivation step independently, revealing where reasoning breaks down.

Causal Inference

Counterfactual and interventional reasoning tasks designed to distinguish causal from correlational thinking.

Compositional Planning

Tasks requiring multi-stage planning where later steps depend on earlier choices — exposing lookahead failures.

Quick Start

pip install -e .

from fallax import Benchmark, Scorer

bench = Benchmark.load("logical_deduction")
scorer = Scorer(model="gpt-4o")
results = scorer.score(bench)
print(results.step_accuracy)