This blog walks through an end-to-end alignment stack for an open-ended dialogue model: from low-level training stability in Mixture-of-Experts (MoE) to FP8 serving, supervised fine-tuning (SFT) on user chats, and finally reinforcement learning (RL) driven by engagement and retention signals.
Feel free to connect with us on LinkedIn for discussions about RL, AI assistants, and ML infrastructure!
This blog covers:
When using expert parallelism in a MoE architecture, some parameters are replicated within an expert parallel group. This introduces a subtle issue: their gradients can diverge if not explicitly synchronized.
To fix this, explicit gradient reduction hooks are added in the backward pass of the MoE layer. These hooks ensure that for parameters replicated across the expert parallel group, gradients are synced (e.g., via an all-reduce) every backward step.
Internal experiments compare training curves with proper gradient synchronization (for positional embeddings and expert parallel group) and without gradient sync. Without gradient synchronization:
This simple but critical fix—adding backward hooks for replicated parameters—was necessary to make MoE training stable.
A major practical issue is that a pure FP16 (bf16 / fp16) training and serving stack can be extremely expensive at inference time.
In contrast, an FP8 version of the model can fit on a single node, dramatically improving serving efficiency.
Exporting fine-tuned checkpoints to FP8 can be done using a block-wise quantization algorithm [1, 2].
Key observations:
To address this, FP8 mixed-precision preserving training can be implemented, following the strategy described in DeepSeek-V3 Technical Report [3], as shown in the figure. The core idea is to train with FP8 in the loop instead of quantizing only at the end. The key setup in mixed precision AdamW is: master weights in FP32, forward/backward passes and gradients in BF16, while native PyTorch Adam forces optimizer states to be in the same precision as parameter dtype.

To support FP8 training efficiently, an FP8 GEMM library [4] can be used:
However, this library only accelerates the matrix multiplication kernels themselves.
A full FP8 linear layer does more than just a GEMM—there is:
In the naive implementation:
Linear layer can be roughly 10× slower than the native PyTorch bf16 Linear layer.To fix this, a custom fused FP8 linear layer can be implemented, combining quantization, dequantization, and matrix multiplication more tightly so FP8 training and serving become practical, not just numerically feasible.
One concrete use case in the alignment stack is supervised fine-tuning (SFT) on user chats.
One key interaction pattern in dialogue systems is swipes or votes, like what can be seen in ChatGPT or Gemini. This feature enables DPO training.
This creates a highly engaged user base, which in turn generates a huge amount of swipe data/preference data—on the order of hundreds of billions of tokens of swipe/vote interactions per day. With some quality filtering, this approach uses:
This acts like user-guided rejection sampling:
Rejection sampling is known to improve model performance across many tasks; here, user swipes/votes implement a kind of implicit rejection sampling at scale. The “choose after swipe/vote” event is also an important reward signal that can be reused in RL.
Another major use case for user chat data is distillation:
When using user chat data for this distillation, key findings include:
In addition to SFT, user data can be leveraged for PPO-style RL training.
The signals include:
User-selected data
User preference data
Synthetic preference data
However, user preference data is inherently noisy, so substantial filtering is needed before using it.
Some of the filtering strategies used include:
Limiting individual user influence Avoid letting a small set of highly active users dominate the training signal.
Filtering over-repeated conversations Avoid conversations that users trigger “too often” or look bot-like.
Requiring minimum reading time Ensure users spend enough time reading messages before making a choice; otherwise, their signal may not reflect genuine preference.
These heuristics are combined with other filters to construct a cleaner preference dataset for PPO.
At this point, the fine-tuning stack has been described: SFT on user chats, distillation, and preference data preparation, all powered by user interactions.
Next, the focus moves to RL work.
The core challenge: Aligning models for open-ended dialogue, where:
Reinforcement learning can be used to:
An RL training framework:
On top of this:
A core part of this RL work is reward modeling, because deciding which response is “better” is highly subjective in the dialog domain.
Before diving deeper into reward modeling, it’s worth asking: how can models be evaluated at all?
Professional writers and annotators can be used to evaluate generations, using a mixture of:
Creative writing dimensions:
General quality metrics:
Reward model scores can also be used as an offline proxy (more on those below).
Ultimately, however, what really matters is online A/B testing:
After that, the key metrics are engagement and retention, such as:
An important caveat:
If offline metrics are optimized in isolation, the model that looks better offline might not actually improve engagement or retention when deployed in a real A/B test.
Offline metrics and reward model scores are treated as proxies, but online experiments are the final judge.
For reward modeling, approaches can go beyond standard “quality” and “preference” models.
In addition to:
engagement-based reward models can be built.
These models try to predict:
Future number of turns A regression model that predicts how many future turns a conversation will have.
Whether the user continues the conversation A classifier predicting if the user will send another message.
Time to next response A signal about the immediacy of continued engagement.
Models can also be trained to predict longer-term retention, such as:
Empirically from industry research:
These are quite high accuracies, especially for retention predictions, which suggests the models may have learned:
This risk should be considered. However:
it can still be beneficial. That said, retention-based rewarding is still experimental, and A/B test results for such rewards may not be finalized yet.
For some of the more objective quality dimensions (e.g., novel contribution), supervision can be obtained from LLM-judged data—using large models to annotate or compare responses.
Direct user feedback signals can also be used:
There are other user signals that may not be prioritized yet if strong correlations with key KPIs are not found in data analysis.
A simple baseline for combining all these signals is:
These weights can also be learned to directly maximize a true business KPI, such as 7-day retention, using historical logs.
A more sophisticated reward function is multi-objective with constraints.
For example:
On the RL side, PPO (Proximal Policy Optimization) can be used, similar to the original InstructGPT / RLHF setup:
A common question is: “You collect metrics online—are you doing online RL, updating weights in real time?”
The answer is: No, the training itself is offline.
A typical workflow is:
This is a continuous but batched process, not fully online RL.
Another common question: “Conversations can be long, multi-turn, with delayed rewards. How can that be handled?”
This can be addressed in two ways:
Long-horizon reward models
Some reward models are explicitly designed to capture longer-term signals:
These provide more trajectory-level feedback rather than per-turn myopic rewards.
Multi-turn RL experiments
This combination helps deal with very long, horizontal, multi-round conversations where rewards are inherently delayed.
To summarize, an alignment stack for open-ended dialogue models combines:
PPO-based RL, grounded in:
This runs with an offline RL loop, iterating between deployment, data collection, reward model training, and policy updates. Handling noisy user feedback and long-horizon rewards is still an active area of experimentation, especially around retention-based rewards and multi-turn RL. Work is ongoing—both on the systems side (better FP8 kernels, more efficient training) and on the alignment side (richer reward models, better correlation with online KPIs, and safer, more engaging user experiences).