Document Type
Restricted
Advisor
Gary Parker
Publication Date
2026
Abstract
NETHACK is among the most demanding testbeds in modern reinforcement-learning research: a partially observed, procedurally generated environment with a 121-way discrete action space, a sparse and heavy-tailed reward signal, and a state space large enough that near-zero overlap exists between any two episodes. We present a curriculum-and-consensus approach to the problem. We pretrain thirteen recurrent actor-critic experts, each on a handcrafted difficulty ladder of MINIHACK tasks that isolates a single competence (navigation, melee combat, item use, exploration, multi-room traversal, staircase descent, prayer, and so on), and we propose a Hierarchical Options Mixture-of- Experts (HO-MoE) consensus model that composes a subset of the trained experts into a single policy on the full NetHackScore-v0 environment. The consensus fuses K = 8 frozen experts through a learned option router, a set of per-expert feature adapters, and a state-dependent mixing coefficient λt that trades off an independently trained consensus head against the router-weighted expert mixture.
We identify catastrophic forgetting as the dominant failure mode of a naive sequential curriculum, and we introduce a scheduler that mixes previously-mastered levels into the live training distribution with a tunable review probability, demotes the frontier when a mastered level regresses, and advances on a bounded-time fallback when mastery stalls. We treat training throughout as a synchronous onpolicy problem and learn both experts and consensus with Proximal Policy Optimization (Schulman et al., 2017) and Generalized Advantage Estimation (Schulman et al., 2016), matching the recent PPO-based NLE baselines (Hambro et al., 2022b; Petrenko et al., 2020). Our consensus loss combines the standard policy-gradient, value, and action-entropy terms of PPO with a router-z stabilizer (Zoph et al., 2022) and an option-stickiness loss that pulls the router toward temporal commitment. We report expert success rates per curriculum level, consensus reward trajectories on NetHackScore-v0, and a post-training analysis of the option-router posterior and the statedependent mixing coefficient on the final trained model.
Recommended Citation
Nash, Jay B., "Curriculum Learning for NetHack via a Hierarchical Mixture-of-Experts Consensus" (2026). Computer Science Honors Papers. 15.
https://digitalcommons.conncoll.edu/comscihp/15
The views expressed in this paper are solely those of the author.
Comments
This paper is restricted to the Connecticut College campus until December 8, 2026.