IV. Training Objective & Optimization

IV. Training Objective, Optimization Strategy & Learning Constraints

This section describes how the model from Section 3 is trained under temporal constraints, event imbalance, and data leakage risks, while establishing clear and verifiable statistical objectives.

Section objectives:

Define the formal objective function
Describe pipeline-level leakage prevention
Address class imbalance and outcome sparsity
Explain regularization for robustness
Establish the regime-aware training framework

4.1 Learning Setting — Time-Consistent Supervised Learning

Let the sequence for asset $a$ at evaluation time $t$ be:

$$\tilde{\mathcal{S}}^a_{(-\infty,t)} = \{z_i^a \mid t_i^a < t\}$$

And define the outcome event (from Section 1)

$$y_t^a \in \{-1, 0, 1\}$$

The objective is to estimate

$$\hat{p} = \hat{p}(y_t^a \mid \tilde{\mathcal{S}}^a_{(-\infty,t)})$$

Using only information available before time $t$.

This constitutes a causal-valid learning setting, distinct from time-agnostic forecasting.

4.2 Time-Based Dataset Splitting (No Random Shuffle)

To prevent leakage, dataset splitting is performed temporally

$$\text{Train} < \text{Validation} < \text{Test}$$

In chronological order only:

$$[0, T_{train}] < [T_{val}^{start}, T_{val}^{end}] < [T_{test}^{start}, T_{test}^{end}]$$

Random splitting is prohibited because:

Future events may indirectly appear in training
Macro-regime patterns may leak across splits

Pipeline enforcement principles

Sequence builder restricts events to ≤ cutoff time
Label builder references only future outcomes after cutoff
Model loader verifies time-consistency

4.3 Training Objective Function

Let the final sequence representation be $u_t^a$

Probability estimation function:

$$\hat{p}(y_t^a = k) = \frac{\exp(w_k^\top u_t^a)}{\sum_j \exp(w_j^\top u_t^a)}$$

Define the cross-entropy loss

$$\mathcal{L}_{CE} = -\sum_{(a,t)} \sum_{k} \mathbb{1}[y_t^a = k] \log \hat{p}(y_t^a = k)$$

4.4 Class Imbalance & Event-Sparsity Handling

Since outcome events are typically rare, we define a weighted loss

$$\mathcal{L}_{WCE} = -\sum_{(a,t)} \omega_{y_t^a} \log \hat{p}(y_t^a)$$

where

$\omega_k \propto 1/\text{freq}(k)$
Alternatively, use focal loss to emphasize hard examples

$$\mathcal{L}_{FL} = -\sum_{(a,t)} (1 - \hat{p}(y_t^a))^\gamma \log \hat{p}(y_t^a)$$

Research rationale: the goal is not artificial class balancing, but preventing the model from "ignoring rare but important events."

4.5 Regularization for Robust Pattern Learning

To avoid overfitting to short-term noise, we add regularization components.

(1) Feature-Gating Sparsity

$$\Omega_{gate} = \lambda_g \|\alpha\|_1$$

Encourages the representation to select only necessary features.

(2) Token Dropout / Event Masking

At the sequence level

$$z'_i = \begin{cases} z_i, & \text{with prob } (1-p)\\ \varnothing, & \text{with prob } p \end{cases}$$

Semantic effects:

Forces the model to learn patterns not dependent on any single token
Increases robustness to missing signals

(3) Temporal Smoothing Penalty

For cases where representations fluctuate abnormally

$$\Omega_{temp} = \lambda_t \sum_i \|h_i - h_{i-1}\|_2^2$$

This encodes the prior that patterns should accumulate gradually, not shift erratically due to isolated noise.

(4) Total Objective

$$\mathcal{L}_{total} = \mathcal{L}_{task} + \Omega_{gate} + \Omega_{temp} + \text{(augmentation penalty)}$$

4.6 Macro-Aware Training as Context, Not Filter

Critically: regimes are not used to "split models" but serve as context within a unified sequence.

Therefore

$$\mathcal{L}_{total} = \sum_{\text{all eras}} \mathcal{L}_{era}$$

But the representation $u_t^a$ is conditioned on macro tokens in the sequence.

This enables post-hoc testing of:

Which patterns persist across eras
Which patterns are era-specific

Without discarding data.

4.7 Leakage Prevention — Formal Checklist

A model is considered invalid if any of the following occur:

Future events are present in the input sequence
Macro-events are defined using hindsight
Price-derived features indirectly use future information
Validation/test sets overlap with training regime data

We define a verification function

$$\text{CheckLeakage}(\mathcal{S}) = \begin{cases} \text{True} & \text{if time-causality violated}\\ \text{False} & \text{otherwise} \end{cases}$$

Applied in the pipeline before every training run.

4.8 Training as Representation Learning — Not Trading System

Research position: the learning objective is representation for pattern investigation, not performance optimization for trading.

Therefore, results are interpreted within this framework:

Representation stability across eras
Sequence-to-outcome relationships
Not direct economic returns

4.9 Implementation View — For Development Teams

Training Pipeline

build causal sequence
 → encode tokens
 → transformer sequence encoder
 → attention pooling
 → output head
 → weighted/focal loss
 → add sparsity + temporal penalties
 → optimize

Validation Loop (time aware)

Fixed forward-rolling windows
Prohibit look-ahead
Explicit macro-context in sequence

4.10 Connection to Next Section

This section established:

Learning objectives
Temporal constraints
Imbalance handling
The role of macro context in training

The next section covers: Regime-Conditioned Analysis & Post-Training Diagnostics (post-training pattern analysis, per-era behavior, explainability).