III. Model Architecture & Sequence Learning Framework
This section describes how the model learns sequential structure from token embeddings (Section 2), given a unified sequence containing both macro-events and asset-events.
We present:
- Mathematical definitions of the architecture
- The role of each layer
- Design rationale
- Connection to research assumptions
- Implementation-ready perspectives for development teams
3.1 Sequential Representation Learning — Core Concept
The fundamental hypothesis of this research is that meaningful relationships do not reside in isolated feature values, but emerge from the ordering of events and the temporal rhythm of their accumulation.
Therefore, the core model is not a regression model, but a sequence representation model.
Let the embedding set for asset $a$ be
The sequence model's task is to find a function
where
- $h_i$ = representation after the model processes the sequence up to position $i$
- Incorporates information about: the event's characteristics, relationships with preceding events, and macro events present in the sequence
Semantically: $h_i$ represents "the model's understanding at that point in time," reflecting patterns currently forming.
3.2 Backbone Candidates: Why Sequence Models
A suitable model for this problem must support:
- Irregular and sparse sequences
- Long-horizon memory
- Multi-level contextual events (asset + macro)
- Non-linear interactions
Candidate architectures include:
- Transformer-based Event Sequence Model
- Temporal Convolutional Network (TCN)
- GRU / LSTM (as baselines)
This research selects Transformer as the primary model, with TCN / GRU for experimental comparison.
Theoretical rationale:
- Transformer supports long-range dependencies via self-attention
- Well-suited for tokens comprising both micro and macro events
- Architecture aligns with the conceptual idea: "let events explain each other through attention"
3.3 Transformer-Based Event Sequence Model (Formal)
Given input $(e_1,\dots,e_n)$
We apply a linear projection
Then pass through multi-head self-attention
where
Output for head $h$
Concatenating multiple heads
Followed by a feed-forward layer
With residual connections and layer normalization
Interpretation:
- The model does not read events sequentially one-by-one
- Instead, all events attend to and relate to each other
- Macro event tokens serve as "context anchors" that give meaning to the sequence
3.4 Why Self-Attention Suits Macro-Event Tokens
When a macro event is interleaved in the sequence, for example:
[t1] MACRO_QE_START
[t2] FEATURE_A
[t3] FEATURE_B
[t4] FEATURE_C
The attention matrix automatically learns that events at $t_2, t_3, t_4$ should "bind" to the macro token at $t_1$.
Mathematically: if the attention weight between event $i$ and macro-event $j$ is high
This means the asset-side event representation is directly conditioned on macro context.
This is the heart of the methodology: rather than writing rules manually, we let the model discover "which macro events relate to which patterns."
3.5 Positional & Temporal Conditioning
In real data, absolute position matters less than the time gaps between events.
Therefore, we use time-gap embedding (from Section 2)
This gives event representations richer meaning:
- Not merely "an event occurred"
- But "an event occurred X time units after the previous event"
This enables the model to learn distinctions such as:
- Same pattern, different pace (fast vs. slow) → potentially different meaning
- Macro shock + short-interval events → possibly more significant than macro + slow pattern
In other words: the model learns the "geometry of time," not just flat sequential order.
3.6 Sequence-Level Representation
To assess future trends, we need to extract a representation for the entire sequence.
Given $h_1, \dots, h_n$, we construct a summary vector
Options include mean pooling or attention pooling (recommended).
Attention pooling:
The weights $\beta_i$ indicate which events are most important to the overall pattern — enabling post-hoc interpretation (explainability).
3.7 Output Head — Probability of Outcome Event
Given the final representation $u$, we define
Or for multi-class classification
Loss function
Key implications:
- The model learns how strongly preceding sequences relate to outcome events
- This is not continuous price prediction
3.8 Regime-Conditioned Representation (Implicit Conditioning)
Since macro events reside within the same sequence, conditioning occurs implicitly:
We can express semantically
This constitutes Regime-Aware Representation Learning without requiring separate models for each era.
3.9 Feature-Gated Attention (Optional Enhancement)
Reflecting the hypothesis that only certain features matter, and their importance depends on sequence context:
We modulate attention weights with gating
With a sparsity penalty
Semantically:
- The model selects "groups of important events"
- Without requiring manual specification
3.10 Implementation View — For Development Teams
Teams can view the structure as a pipeline:
raw events
→ build unified sequence (asset + macro)
→ event embedding
→ transformer sequence encoder
→ sequence summary (attention pooling)
→ output head
→ training objective
In frameworks such as PyTorch / JAX:
- Event = token
- Macro event = a token type
- Time gap = positional input
- Sequence model = standard Transformer layer
- Interpretability = attention map + pooling weights
No component requires hand-crafted rules — everything emerges from statistical learning.
3.11 Research Rationale for This Architecture
This architecture addresses the core hypotheses:
- Supports long-horizon pattern accumulation
- Bridges micro ↔ macro levels
- Maintains a unified representation across all eras
- Allows some patterns to be era-specific while others persist across eras
- Enables post-hoc analysis: per-era attention maps, per-feature gating, per-context behavior
3.12 Connection to Next Section
This section explained "how the model learns sequences and context."
The next section covers: Training Objective, Regularization, and Regime-Aware Evaluation, diving deeper into:
- Time-consistent training methodology
- Leakage prevention
- Strategies for outcome imbalance
- Post-training representation interpretation