III. Model Architecture & Sequence Learning

III. Model Architecture & Sequence Learning Framework

This section describes how the model learns sequential structure from token embeddings (Section 2), given a unified sequence containing both macro-events and asset-events.

We present:

Mathematical definitions of the architecture
The role of each layer
Design rationale
Connection to research assumptions
Implementation-ready perspectives for development teams

3.1 Sequential Representation Learning — Core Concept

The fundamental hypothesis of this research is that meaningful relationships do not reside in isolated feature values, but emerge from the ordering of events and the temporal rhythm of their accumulation.

Therefore, the core model is not a regression model, but a sequence representation model.

Let the embedding set for asset $a$ be

$$\mathbf{E}^a = (e_1, e_2, \dots, e_n)$$

The sequence model's task is to find a function

$$F_{\Theta}: (e_1,\dots,e_n) \longrightarrow (h_1,\dots,h_n)$$

where

$h_i$ = representation after the model processes the sequence up to position $i$
Incorporates information about: the event's characteristics, relationships with preceding events, and macro events present in the sequence

Semantically: $h_i$ represents "the model's understanding at that point in time," reflecting patterns currently forming.

3.2 Backbone Candidates: Why Sequence Models

A suitable model for this problem must support:

Irregular and sparse sequences
Long-horizon memory
Multi-level contextual events (asset + macro)
Non-linear interactions

Candidate architectures include:

Transformer-based Event Sequence Model
Temporal Convolutional Network (TCN)
GRU / LSTM (as baselines)

This research selects Transformer as the primary model, with TCN / GRU for experimental comparison.

Theoretical rationale:

Transformer supports long-range dependencies via self-attention
Well-suited for tokens comprising both micro and macro events
Architecture aligns with the conceptual idea: "let events explain each other through attention"

3.3 Transformer-Based Event Sequence Model (Formal)

Given input $(e_1,\dots,e_n)$

We apply a linear projection

$$z_i = W_e e_i + b_e$$

Then pass through multi-head self-attention

$$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

where

$$Q = ZW_Q, \quad K = ZW_K, \quad V = ZW_V$$

Output for head $h$

$$H^{(h)} = \text{Attention}(Q_h, K_h, V_h)$$

Concatenating multiple heads

$$H = \text{Concat}(H^{(1)}, \dots, H^{(H)}) W_O$$

Followed by a feed-forward layer

$$h_i = \text{FFN}(H_i)$$

With residual connections and layer normalization

$$H_i = \text{LayerNorm}(H_i + z_i)$$

$$h_i = \text{LayerNorm}(h_i + \text{FFN}(H_i))$$

Interpretation:

The model does not read events sequentially one-by-one
Instead, all events attend to and relate to each other
Macro event tokens serve as "context anchors" that give meaning to the sequence

3.4 Why Self-Attention Suits Macro-Event Tokens

When a macro event is interleaved in the sequence, for example:

[t1] MACRO_QE_START
[t2] FEATURE_A
[t3] FEATURE_B
[t4] FEATURE_C

The attention matrix automatically learns that events at $t_2, t_3, t_4$ should "bind" to the macro token at $t_1$.

Mathematically: if the attention weight between event $i$ and macro-event $j$ is high

$$\alpha_{ij} = \text{softmax}\left( \frac{q_i k_j^\top}{\sqrt{d}} \right)$$

This means the asset-side event representation is directly conditioned on macro context.

This is the heart of the methodology: rather than writing rules manually, we let the model discover "which macro events relate to which patterns."

3.5 Positional & Temporal Conditioning

In real data, absolute position matters less than the time gaps between events.

Therefore, we use time-gap embedding (from Section 2)

$$\tilde{z}_i = z_i + \tau_i$$

This gives event representations richer meaning:

Not merely "an event occurred"
But "an event occurred X time units after the previous event"

This enables the model to learn distinctions such as:

Same pattern, different pace (fast vs. slow) → potentially different meaning
Macro shock + short-interval events → possibly more significant than macro + slow pattern

In other words: the model learns the "geometry of time," not just flat sequential order.

3.6 Sequence-Level Representation

To assess future trends, we need to extract a representation for the entire sequence.

Given $h_1, \dots, h_n$, we construct a summary vector

$$u = \text{Pooling}(h_1, \dots, h_n)$$

Options include mean pooling or attention pooling (recommended).

Attention pooling:

$$\beta_i = \text{softmax}(w^\top \tanh(W h_i))$$

$$u = \sum_i \beta_i h_i$$

The weights $\beta_i$ indicate which events are most important to the overall pattern — enabling post-hoc interpretation (explainability).

3.7 Output Head — Probability of Outcome Event

Given the final representation $u$, we define

$$\hat{p}(y \mid \tilde{\mathcal{S}}) = \sigma(W_o u + b_o)$$

Or for multi-class classification

$$\hat{p}(y=k) = \frac{\exp(w_k^\top u)}{\sum_j \exp(w_j^\top u)}$$

Loss function

$$\mathcal{L} = -\sum_{a,t} \log \hat{p}(y^a_t \mid \tilde{\mathcal{S}}^a_{(-\infty,t)})$$

Key implications:

The model learns how strongly preceding sequences relate to outcome events
This is not continuous price prediction

3.8 Regime-Conditioned Representation (Implicit Conditioning)

Since macro events reside within the same sequence, conditioning occurs implicitly:

$$\hat{p}(y) = \hat{p}(y \mid \text{pattern of micro events}, \text{macro tokens in sequence})$$

We can express semantically

$$h_i \approx \Phi(\text{micro events}, \text{macro context}, \text{temporal structure})$$

This constitutes Regime-Aware Representation Learning without requiring separate models for each era.

3.9 Feature-Gated Attention (Optional Enhancement)

Reflecting the hypothesis that only certain features matter, and their importance depends on sequence context:

We modulate attention weights with gating

$$\alpha'_{ij} = g_{\psi}(x_i) \cdot \alpha_{ij}$$

With a sparsity penalty

$$\Omega(\psi) = \lambda \|\psi\|_1$$

Semantically:

The model selects "groups of important events"
Without requiring manual specification

3.10 Implementation View — For Development Teams

Teams can view the structure as a pipeline:

raw events
 → build unified sequence (asset + macro)
 → event embedding
 → transformer sequence encoder
 → sequence summary (attention pooling)
 → output head
 → training objective

In frameworks such as PyTorch / JAX:

Event = token
Macro event = a token type
Time gap = positional input
Sequence model = standard Transformer layer
Interpretability = attention map + pooling weights

No component requires hand-crafted rules — everything emerges from statistical learning.

3.11 Research Rationale for This Architecture

This architecture addresses the core hypotheses:

Supports long-horizon pattern accumulation
Bridges micro ↔ macro levels
Maintains a unified representation across all eras
Allows some patterns to be era-specific while others persist across eras
Enables post-hoc analysis: per-era attention maps, per-feature gating, per-context behavior

3.12 Connection to Next Section

This section explained "how the model learns sequences and context."

The next section covers: Training Objective, Regularization, and Regime-Aware Evaluation, diving deeper into:

Time-consistent training methodology
Leakage prevention
Strategies for outcome imbalance
Post-training representation interpretation