Technology

Kuaishou Open-Sources GoLongRL, Overcoming Synthetic Intelligence Bottlenecks in Lengthy-Context Reinforcement Studying


Kuaishou Expertise’s giant language mannequin crew, in collaboration with the College of Chinese language Academy of Sciences, has open-sourced GoLongRL, a complete post-training framework designed to resolve important efficiency degradation in synthetic intelligence fashions processing exceptionally lengthy textual content sequences.

Present long-context reinforcement studying methods undergo from extremely homogenous coaching knowledge that focuses virtually solely on finding particular knowledge factors inside lengthy essays. This slender method leaves fashions unequipped to deal with complicated structural textual content duties like sorting, summary summaries, or multi-hop logical reasoning. To deal with this limitation, the Chinese language analysis crew launched a completely open-source system that features a high-utility dataset of almost 23,000 heterogeneous samples, full coaching supply codes, and a specialised machine studying optimization algorithm.

The dataset options 22,965 exactly cataloged samples structured throughout 9 distinct core activity sorts to comprehensively prepare long-context comprehension. Relatively than counting on artificial templated inputs, which regularly educate fashions to depend on superficial paragraph boundaries, the dataset prioritizes real supply supplies, together with literature from Undertaking Gutenberg, tutorial preprints, authorized paperwork, and company monetary filings. For domains missing labeled knowledge, the pipeline synthesizes solely the question-and-answer pairs based mostly on uncooked inputs, guaranteeing high-fidelity knowledge integrity.

To maximise studying effectivity throughout diverse duties like sorting, knowledge extraction, and summary technology, the researchers deserted conventional single-metric reward capabilities in favor of localized analysis scripts. Since textual content summaries depend on semantic overlaps whereas ordering sequences relies on rating coefficients, the system assigns distinctive analysis standards tailor-made to every activity’s structural goal.

Managing these diverse reward capabilities launched complicated numerical scaling variances throughout commonplace coaching runs. To stabilize the optimization course of, the researchers developed an algorithm named TMN-Reweight. This math-based framework decouples numerical reward scaling from activity problem corrections, stopping high-variance indicators from disrupting mannequin coaching.

The framework demonstrated quick efficiency positive factors throughout empirical analysis. When utilized to a small four-billion-parameter base, the info and algorithm configuration surpassed specialised competing long-context fashions by a notable margin.

Scaling the framework as much as a bigger thirty-billion-parameter structure yielded much more substantial outcomes. The ensuing mannequin achieved a prime analysis metric rating, outperforming elite, closed-source basis fashions together with DeepSeek-R1, Alibaba’s large-scale Qwen reasoning programs, and Google’s Gemini Flash framework.

Importantly, the reinforcement studying course of didn’t set off unfavourable capabilities switch relating to basic analytical reasoning. The fashions demonstrated minor, regular enhancements on mainstream intelligence benchmarks whereas displaying robust capabilities switch into completely novel fields, similar to agentic reminiscence and multi-turn conversational dialogue recollection.

The framework additionally displayed important sequence size extrapolation capabilities. Though the mannequin was educated on a tough most restrict of 160,000 textual content tokens, its core synthesis and knowledge retrieval capabilities efficiently generalized out to processing blocks containing as much as a million textual content tokens, proving that the discovered processing methods are length-agnostic.