Document Type

Restricted

Advisor

Gary Parker

Publication Date

2026

Comments

This paper is restricted to the Connecticut College campus until December 8, 2026.

Abstract

NETHACK is among the most demanding testbeds in modern reinforcement-learning research: a partially observed, procedurally generated environment with a 121-way discrete action space, a sparse and heavy-tailed reward signal, and a state space large enough that near-zero overlap exists between any two episodes. We present a curriculum-and-consensus approach to the problem. We pretrain thirteen recurrent actor-critic experts, each on a handcrafted difficulty ladder of MINIHACK tasks that isolates a single competence (navigation, melee combat, item use, exploration, multi-room traversal, staircase descent, prayer, and so on), and we propose a Hierarchical Options Mixture-of- Experts (HO-MoE) consensus model that composes a subset of the trained experts into a single policy on the full NetHackScore-v0 environment. The consensus fuses K = 8 frozen experts through a learned option router, a set of per-expert feature adapters, and a state-dependent mixing coefficient λt that trades off an independently trained consensus head against the router-weighted expert mixture.

We identify catastrophic forgetting as the dominant failure mode of a naive sequential curriculum, and we introduce a scheduler that mixes previously-mastered levels into the live training distribution with a tunable review probability, demotes the frontier when a mastered level regresses, and advances on a bounded-time fallback when mastery stalls. We treat training throughout as a synchronous onpolicy problem and learn both experts and consensus with Proximal Policy Optimization (Schulman et al., 2017) and Generalized Advantage Estimation (Schulman et al., 2016), matching the recent PPO-based NLE baselines (Hambro et al., 2022b; Petrenko et al., 2020). Our consensus loss combines the standard policy-gradient, value, and action-entropy terms of PPO with a router-z stabilizer (Zoph et al., 2022) and an option-stickiness loss that pulls the router toward temporal commitment. We report expert success rates per curriculum level, consensus reward trajectories on NetHackScore-v0, and a post-training analysis of the option-router posterior and the statedependent mixing coefficient on the final trained model.

Share

COinS
 

The views expressed in this paper are solely those of the author.