VI. Experimental Design & Dataset

VI. Experimental Design, Dataset Construction & Reproducibility Framework

This section establishes the methodological standards for the entire research:

How data is prepared
How labels and event sequences are constructed
How data partitioning and time periods are designed
How experiments are conducted to ensure unbiased and reproducible results

Formally: this section constitutes the "research protocol" ensuring that results derive from transparent, reproducible processes rather than trial-and-error tuning.

6.1 Dataset Definition — Multi-Asset Event-Time Panel

Let there be a set of assets

$$\mathcal{A} = \{a_1,\dots,a_M\}$$

And a data time range

$$[T_{start}, T_{end}]$$

For asset $a$, the unified event set (from Section 2):

$$\tilde{\mathcal{S}}^a = \{(t_i^a, x_i^a, v_i^a)\}_{i=1}^{N_a}$$

With discrete-time prices

$$P^a(t), \quad t\in\mathcal{T}^a$$

The complete system dataset

$$\mathcal{D} = \big\{ (\tilde{\mathcal{S}}^a, P^a(t)) \mid a\in\mathcal{A} \big\}$$

Structurally:

This is event-time panel data
Not based on uniform-interval sampling
Emphasizing the "event → sequence → outcome" structure

6.2 Event Construction Protocol (Asset & Macro Levels)

6.2.1 Asset-Level Event Extraction

Events derive from various feature types:

Boolean condition triggers
Factor-state transitions
Indicator crossing events
Structural / fundamental signals

Define the event generator

$$\Phi_{asset}: \text{raw feature stream} \longrightarrow \mathcal{X}_{asset}$$

Requirements:

Events must derive from information known at that time
Back-filled data is prohibited
Timestamps must strictly precede outcomes

6.2.2 Macro-Level Event Definition

The macro event set:

$$\mathcal{M} = \{(t_j^{macro}, m_j)\}$$

Must be defined according to ex-ante observable rules, such as:

Official QE start dates from announcements
Interest rate change dates
Shock dates recorded in public sources

Hindsight definitions are prohibited, such as "retrospectively considering this period a crisis."

Source documentation must be specified and definitions frozen before training begins.

6.2.3 Unified Event Merge Procedure

The sequence merge process (for asset $a$):

$$\tilde{\mathcal{S}}^a = \text{merge-sort}(\mathcal{S}^a, \mathcal{M})$$

Enforced invariants:

Non-decreasing time: $t_1^a \le t_2^a \le \dots$
Identical macro events across all assets
No retroactive event creation

6.3 Outcome Label Construction

Let future returns be

$$r^a(t,\Delta) = \frac{P^a(t+\Delta)-P^a(t)}{P^a(t)}$$

Define the outcome function

$$y_t^a = g(r^a(t,\Delta))$$

Threshold-based example:

$$y_t^a = \begin{cases} 1, & r^a(t,\Delta) \ge \tau_{up}\\ -1, & r^a(t,\Delta) \le \tau_{down}\\ 0, & \text{otherwise} \end{cases}$$

Critical requirements:

$y_t^a$ must use only prices in the interval $[t, t+\Delta]$
Median / forward fill across future periods is prohibited
$\tau$ and $\Delta$ must be specified before experiments and locked

6.4 Causal-Safe Training Window Construction

For each sample at time $t$

$$X_t^a = \tilde{\mathcal{S}}^a_{(-\infty,t)}$$

$$Y_t^a = y_t^a$$

That is

$$(X_t^a, Y_t^a) \quad\text{constructed as} \quad \textbf{past} \rightarrow \textbf{future}$$

Rolling retrospective windows that consume future data are prohibited.

6.5 Temporal Splitting & Forward Evaluation Protocol

To maintain time-causal validity, partition into:

Train
Validation
Test

With

$$T_{train} < T_{val} < T_{test}$$

And rolling-forward evaluation:

$$\Pi_k = \Pi\left( [t_k, t_{k+1}] \right)$$

Benefits:

Verification of performance stability over time
Reduced risk of selecting biased periods

6.6 Experimental Arms (What We Compare Against)

For meaningful research results, comparable baselines are required.

6.6.1 Baseline Models

Random / Majority baseline
Logistic regression on aggregated features
GRU / LSTM (no macro tokens)
Transformer without macro tokens
Transformer with macro tokens (proposed)

Objective: not to "beat" baselines, but to demonstrate that incorporating "event sequences + macro context" provides structural informational value.

6.7 Hyperparameter Governance (Pre-Analysis Rule)

To avoid tuning bias, specify pre-registered ranges

Examples:

$$d \in \{64, 128, 256\}$$

$$L \in \{2, 4, 6\}$$

$$\lambda_g \in \{10^{-4}, 10^{-3}, 10^{-2}\}$$

Final model selection:

Based on time-split validation
Retrospective selection from test set is prohibited

6.8 Reproducibility Requirements

Research is considered reproducible when it includes:

Versioned dataset recipe (raw data need not be released, but construction formula must be)
Config-locked experiment files (e.g., YAML / JSON)
Recorded commit hash, random seed, hyperparameters, training logs
Consistency verification functions such as

$$\text{Hash}(\tilde{\mathcal{S}}^a) = \text{constant}$$

Confirming sequences remain unchanged between runs.

6.9 Error & Risk Audit — What Can Go Wrong

For transparency, potential risks must be assessed:

Regime mis-labeling
Survivorship bias in stock universe
Corporate actions causing price jumps
Missing-event distortion
Redundancy of correlated events

All items should be documented in a risk appendix.

6.10 Interpretation Scope & Ethical Boundary

This document specifies that:

Results are for structural research purposes
Not to be interpreted as profit prediction tools
No direct causal claims are made
This is pattern-relation research only

6.11 Connection to Final Sections

With Section 6 establishing foundations for dataset, experiment protocol, and reproducibility:

The next section addresses Limitations, Extensions & Future Research Directions (scope limitations, extension frameworks, and future research directions).