Probabilistic Time Series Forecasting
July 14, 2025
Related Work on Probabilistic Forecasting
Traditionally, point or single-valued forecasting methods have been among the most common forecasting techniques due to their simplicity (Benidis et al. 2022). However, these methods lack information about uncertainties of their predictions, which can be a major disadvantage when the forecasts are used in decision-making. Hence, Gneiting and Katzfuss (2014) state that forecasts should take on a probabilistic form, as this enables the modelling of uncertainties in forecasts. The general goal is to produce a probabilistic estimate for \(p(x_{L+1:L+H}| x_{1:L})\). Moreover, this distribution can be represented equivalently by its probability density function (PDF), the cumulative density function (CDF) or its inverse, the quantile function (Benidis et al. 2022). In the following, we present various methods for generating probabilistic forecasts, however precise formulations of adopted approaches are not presented.
Parametric Distributional Forecasting.
A common approach to produce an estimate for \(p(x_{L+1:L+H} | x_{1:L})\) is via parametric distributional forecasting, where models typically output location and spread parameters of a pre-chosen probability distribution, which can be maximized via the log-likelihood with respect to the ground-truth \(x_{L+1:L+H}\) (Bergsma et al. 2022). For instance, an early NN-based example is the work of Nix and Weigend (1994), where neural networks were trained to output the mean \(\hat{\mu}\) and variance \(\hat{\sigma}^2\) of a Gaussian distribution for regression tasks. While Gaussian likelihoods are common, alternative distributions such as Student-t (Alexandrov et al. 2020), negative binomial (Salinas et al. 2020) or Gaussian mixture distributions (Mukherjee et al. 2018) have been used depending on the statistical properties of the data. DeepAR (Salinas et al. 2020), an IMS CI RNN-based approach, adopts parametric distributional forecasting with a negative binomial distribution to model demand data. Building on this, VQ-AR (Rasul et al. 2022) combines the DeepAR backbone with a Vector Quantized-Variational Autoencoder (VQ-VAE) architecture (Van den Oord et al. 2017), introducing a discrete latent bottleneck that captures recurring temporal patterns in probabilistic forecasting. Extending this idea to Transformer-based architectures, VQ-TR (Rasul et al. 2023) integrates VQ-VAE into the attention mechanism of Transformers. Lastly, CNN-based models like BiTCN (Sprangers et al. 2023) and DeepTCN (Chen et al. 2019) also leverage parametric distributional forecasting but with a DMS CD strategy. A disadvantage of these approaches is that they require an a priori choice of the distributional form and are limited to the parametric assumptions, which may not capture complex data dynamics. To address the limitations of fixed-form parametric models, non-parametric approaches have gained interest in the community.
Flexible Density Estimation.
Among non-parametric approaches, normalizing flows (Tabak and Turner
2013; Papamakarios et al.
2021)
have emerged
as a powerful tool for flexible density estimation. Normalizing flows,
such as Real NVP (Dinh et al. 2017) and Masked
Autoregressive Flow (MAF) (Papamakarios
et al. 2017), transform a simple base distribution, e.g.
isotropic Gaussian, into a complex target distribution through a series
of invertible and differentiable mappings. Invertibility ensures the
preservation of probability mass and enables the evaluation of the
corresponding density function at all points (Benidis et al. 2022). A unified
description of normalizing flows and their core principles is provided
by Papamakarios et al. (2021). For time series forecasting,
Rasul et al.
(2020) combine IMS backbones, e.g. RNN
and Transformer, with conditioned normalizing flows to capture
multivariate temporal dependencies without restrictive parametric
assumptions. Furthermore, MANF (Feng, Miao, Xu, et al. 2024) combines
conditioned normalizing flows with multi-scale attention and relative
positional encoding to model multivariate dependencies efficiently in a
DMS fashion. TACTiS and TACTiS-2 (Drouin et al.
2022; Ashok
et al. 2023) introduce an IMS Transformer-based Copula model
using Deep Sigmoidal Flows (Huang et al. 2018) to estimate marginal
CDFs. Although flow-based models retain tractable likelihood computation
via the change of variables formula, they have problems with discrete
data distributions, often present in TSF applications (e.g., sales data)
(Rasul et al.
2020). Moreover, learning highly flexible continuous
distributions for discrete data may encourage learning distributions
with spiking densities at each possible discrete value (Uria et al.
2013). Altogether, this hinders training the models to
maximize the likelihood (Bergsma et al. 2022). To address this,
dequantizing (Rasul et al.
2020), e.g. adding Uniform[0,1] noise, may be
applied to bound the log-likelihood (Theis et al. 2016). However, this
assumes that the discrete nature of the series is known in advance and
that the potential loss in precision is justified relative to the
advantages offered by continuous models (Bergsma et al. 2022). In contrast to
flow-based models, C2FAR (Bergsma et al. 2022) represents time
series variables through a hierarchical sequence of categorical
distributions, built on top of the DeepAR framework. Instead of relying
on fixed parametric forms, C2FAR generates increasingly finer intervals
of support in an autoregressive manner, where each step is conditioned
on coarser previous intervals. This hierarchical discretization allows
it to better capture multi-modal behaviors and extreme values compared
to flat binning or standard parametric approaches. Building on this,
SutraNets (Bergsma et al.
2023) extend the idea to long-term time series problems by
dividing inputs into frequency-based sub-series, which is different from
the regular patching of adjacent time steps as done in models like
PatchTST (Nie
et
al.
2022) or SegRNN (Lin et al. 2023). Each sub-series in
SutraNets is modeled by its own C2FAR-LSTM.
Generative Diffusion Models.
While normalizing flows offer tractable likelihoods and exact sampling through invertible transformations, their structural constraint, particularly invertibility and the need for tractable Jacobians, can limit their expressiveness in modeling complex, high-dimensional, and multimodal distributions (Benidis et al. 2022). In contrast, energy-based models (EBMs) relax these constraints by modeling unnormalized log-probabilities, especially important in high-dimensional spaces (LeCun et al. 2006). However, EBMs are notoriously difficult to train due to challenges in sampling and normalizing (Du and Mordatch 2019). Diffusion probabilistic models (Ho et al. 2020; Sohl-Dickstein et al. 2015; Graikos et al. 2022), such as the well-known denoising diffusion probabilistic model (DDPM) (Ho et al. 2020), can be viewed as a practical compromise: they implicitly learn energy gradients via score-based training (Hyvärinen 2005; Song et al. 2021) and enable stable sampling using Langevin-like denoising processes (Neal 2011; Welling and Teh 2011). On a high level, diffusion probabilistic models operate by first applying a forward process that gradually corrupts data into noise, followed by a reverse process that reconstructs the original data from the noise (Ho et al. 2020). Over the last few years, diffusion-based generative models have emerged as strong generative tools, achieving SOTA performances in text generation (X. Li et al. 2022), audio (Kong et al. 2020) and image synthesis (Dhariwal and Nichol 2021). In the context of TSF, diffusion-based models often guide generation by conditioning on partial observations, reference samples, decompositions or architectural priors such as RNNs, transformers or state space models (Rasul et al. 2021; Tashiro et al. 2021; Alcaraz and Strodthoff 2022; Shen and Kwok 2023; Liu et al. 2024; Shen et al. 2023). For instance, TimeGrad (Rasul et al. 2021) uses RNN hidden states, while TimeDiff (Shen and Kwok 2023) incorporates task-specific conditioning like future mixup and autoregressive initialization. On the other hand, mr-Diff (Shen et al. 2023) leverages the multi-scale structure of time series by conditioning the denoising process on progressively refined trends, starting from coarse to fine levels. Similarly, TMDM (Li et al. 2023) and RATD (Liu et al. 2024) integrate transformer-based and retrieval-augmented conditioning, respectively. Going one step further, D\(^3\)M (Yan et al. 2024) introduces a decomposable denoising diffusion framework that unifies continuous flow models and diffusion models, achieving high-speed generation with fewer diffusion steps. Alternatively, TSDiff (Kollovieh et al. 2023) mitigates explicit conditioning during training by adopting an unconditional framework, using a self-guidance mechanism at inference time to adapt to tasks like forecasting and refinement without auxiliary networks. Contrary to previous approaches, DSPD-GP (Biloš et al. 2023) treats the time series as continuous functions rather than discrete measurements, defining diffusion not over discrete vectors but over functions, enabling direct handling of irregularly-sampled time series. Moreover, their Stochastic Process Diffusion framework applies diffusion in function space, using correlated noise from Gaussian processes to preserve temporal continuity and handle irregular sampling. On the other hand, D\(^3\)VAE (Y. Li et al. 2022) proposes a bidirectional VAE augmented with a coupled diffusion process, which simultaneously diffuses input and target series to reduce uncertainty. It further integrates denoising score matching and disentangled latent variables to improve interpretability and robustness, demonstrating strong performance on short and noisy time series.
Latent Generative methods.
While diffusion-based models have gained traction in generative
modeling for their flexibility and ability to capture complex,
multimodal distributions, they often suffer from high computational cost
and slow sampling, limitations that can be especially problematic in
time-sensitive or resource-constrained forecasting scenarios (Yeğin and
Amasyalı 2024; Yang et al. 2024). In
contrast,
latent-variable approaches such as Variational Autoencoders and
(probabilistic) State Space Models offer a compelling alternative by
trading off expressivity for faster inference and improved
interpretability (Tong et al.
2022; Bézenac et al.
2020).
Variational Autoencoders (VAEs) (Kingma and
Welling 2014; Rezende et al. 2014)
simplify generative
modeling by learning to represent complex data distributions in a
lower-dimensional latent space using variational inference. In
variational inference, the main idea is to approximate the true
distribution with a simpler distribution, e.g. Gaussian, and minimize
the Kullback-Leibler (KL) divergence, shown in the upcoming Section [sec:related_work_Prob_Scores]
in Equation [eq:KL], between the approximate
and true
distribution, also known as evidence lower bound optimization (Yeğin and
Amasyalı 2024). Broadly speaking, VAEs first encode the data
into a lower-dimensional latent space where a simpler probabilistic
model can be imposed, then forecasts are generated by decoding samples
drawn from this latent distribution back into the observation space.
Compared to other likelihood-based models like normalizing flows or
energy-based models, VAEs offer efficient, tractable sampling and
readily accessible inference via encoder networks (Vahdat and Kautz 2020). Recent advances
in other generative tasks, e.g. image generation (Vahdat and Kautz
2020; Luo et al. 2025), have
motivated an
adoption in time series forecasting tasks. For instance, the Temporal
Latent AutoEncoder (TLAE) (Nguyen and Quanz 2021) introduces a
nonlinear factorization framework for multivariate time series, enabling
end-to-end learning of complex latent dynamics while preserving
scalability. Similarly, VSMHN (Li et al. 2021), based on the
conditional VAE (Sohn et al. 2015), handles asynchronous
event-driven data with aligned time encodings to jointly forecast across
heterogeneous temporal sources. To better address long-range
dependencies and structural interpretability, PDTrans (Tong et al.
2022) fuses Transformer-based temporal modeling with a
VAE-based latent decomposition, providing interpretable forecasts
through trend and seasonality disentanglement. The Latent Diffusion
Transformer (LDT) (Feng, Miao, Zhang, et al. 2024)
compresses high-dimensional multivariate series into latent
representations using a statistics-aware autoencoder and generates
forecasts via a diffusion-based generator with self-conditioning.
Likewise, the D\(^3\)VAE (Y. Li et al.
2022) model fuses the Nouveau VAE (Vahdat and Kautz 2020) with a coupled
diffusion process and multiscale denoising to enhance robustness under
limited or noisy data.
State Space Models (SSMs)(Hyndman et al.
2002; Seeger et al. 2016; Durbin and Koopman
2012) provide a framework for modeling sequential data via
latent variables that evolve over time according to structured
transition dynamics and generate observations through stochastic
emission processes. Commonly implemented deterministic time series
forecasting methods such as Exponential Smoothing (Hyndman et al.
2002; Hyndman et al. 2008), ARIMA
(Box and Pierce
1970), and LSTMs (Hochreiter and Schmidhuber 1997) can all
be interpreted as special cases or deterministic approximations of SSMs
(Durbin
and Koopman
2012). A classic probabilistic instance is the
linear-Gaussian SSM (Roweis and Ghahramani 1999), which
offers closed-form solutions for filtering, smoothing, and likelihood
computation, enabling efficient handling of missing data, and tractable
multi-step forecasting with full uncertainty quantification (avoiding
error accumulation) (Bézenac et
al. 2020). Furthermore, in their probabilistic form, SSMs
define joint distributions over latent states and observations, such as
in Gaussian Process SSMs (Ko and Fox 2011;
Deisenroth et
al. 2012). This enables uncertainty quantification through
inference techniques like Kalman filtering (Kalman 1960), treating the hidden state
as a distribution that is recursively updated and propagated to generate
predictive distributions over future observations. In probabilistic TSF,
recent works have explored hybrid models that combine the structural
benefits of SSMs with the expressivity of deep learning, where one
prominent direction is the use of deep neural networks to parameterize
SSM components. For example, DeepState (Rangapuram et al. 2018) employs an RNN
to learn a shared global mapping from covariates to the parameters of a
linear-Gaussian SSM, in which each individual time series has an
associated linear Gaussian SSM whose parameters are dynamically
generated by the RNN. Similarly, DNLSSM (Du et al. 2023) integrates LSTM networks
with the unscented Kalman filter (Julier and Uhlmann 2004) to better model
non-linear dynamics. Extending these ideas to attention-based temporal
dependencies, the Probabilistic Transformer (ProTran) (Tang and
Matteson 2021) replaces recurrent components with a
Transformer architecture, enabling the model to learn non-Markovian
dependencies in the latent space via self-attention. From a different
standpoint, PR-SSM (Doerr et al.
2018) substitute the parametric transition dynamics with
Gaussian Processes, where training relies on doubly stochastic
variational inference. To address the restrictive assumption of
Gaussianity, flow-based SSMs have been proposed. For instance, the
Normalizing Kalman Filter (NKF) (Bézenac et
al. 2020) augments a linear-Gaussian SSM by integrating
normalizing flows into the observation model. This allows the latent
state dynamics to remain analytically tractable via Kalman filtering,
while enabling the observation distribution to flexibly model complex,
non-Gaussian behavior. Similarly, EMSSM (Sun et al. 2022) implements an external
memory mechanism and conditional normalizing flows, enhancing the
model’s capacity to capture long-range dependencies and adapt to
distributional shifts.
Generative Adversarial Paradigms.
Generative Adversarial Networks (GANs) (Goodfellow et al. 2014), on the other hand, bypass likelihood estimation altogether, directly learning to generate realistic samples through adversarial training. Furthermore, GANs typically consist of two NNs, a generator and a discriminator. In their min-max adversarial training game, the generator creates synthetic samples while the discriminator is tasked with distinguishing between artificial samples from the generator and real samples (S. Wu et al. 2020). GAN-based models have demonstrated impressive results in various generative tasks, such as image generation (Jiang et al. 2021) and audio synthesis (Donahue et al. 2018). Similarly, GANs primary focus in time series applications also lies in time series synthesis and generation, e.g. see timeGAN (Yoon et al. 2019), GT-GAN (Jeon et al. 2022) or GAN-based approaches in the Time Series Generative Modeling benchmark (Nikitin et al. 2024). Nevertheless, GAN-based probabilistic time series forecasting approaches were implemented as well. For instance, the Adversarial Sparse Transformer (AST) (S. Wu et al. 2020) combines a sparse Transformer with adversarial training to improve time series forecasting. AST employs a generator-discriminator framework, where the generator learns sparse attention patterns for forecasting, while the discriminator ensures sequence-level fidelity. Furthermore, Koochali et al. (2021) introduce ProbCast, a probabilistic forecasting model based on conditional GAN (Mirza and Osindero 2014). On a final note, despite their potential, GANs remain notoriously difficult to train, often facing challenges such as hyperparameter sensitivity, training instability, and mode collapse (Yeğin and Amasyalı 2024).
Quantile.
In practice, specifying an exact (parametric) distribution is often
unnecessary. Instead, estimating a few key quantiles is sufficient for
making optimal decisions, e.g. in epidemiologic forecasting (Bracher et al.
2021; Ray et
al. 2020), wind power forecasting (Wan et al. 2017) or the classical
newsvendor problem (Gneiting et al.
2023; Tarima and
Zenkova 2020; Harsha et al. 2021),
by helping quantify
uncertainty and minimize losses (Bracher et al.
2021; Wen
et al. 2018; Gneiting et al. 2023). As
a
result, a
common non-parametric approach to modeling \(p(x_{L+1:L+H} | x_{1:L})\) is via
the
quantile function, where the models are typically trained using the
quantile loss (see Equation [eq:QS] of
Section
[sec:related_work_Prob_Scores]).
Quantile regression methods (Koenker and
Bassett 1978; Koenker 2005) are
frequently applied in
forecasting, either by predicting a single quantile (for point
forecasts) or multiple quantiles at once (Wen et al.
2018). Effectively, this method approximates the quantile
function by targeting specific quantile levels (Benidis et al. 2022). Two notable
approaches of this kind are the Multi-horizon Quantile Recurrent
Forecaster (MQ-RNN) (Wen et al.
2018) and the Temporal Fusion Transformer (TFT) (Lim et al.
2021). MQ-RNN combines sequence-to-sequence NNs with quantile
regression and a DMS strategy. On the other hand, TFT leverages
attention mechanisms for interpretable DMS forecasting, using
specialized components like gating mechanisms and variable selection
networks to provide insights into temporal dynamics. Nevertheless, these
approaches require the specification of a set of quantile levels a
priori, necessitating retraining when querying new quantile levels
(Park
et al.
2022). As another option, Implicit Quantile Networks (IQNs)
(Dabney et al.
2018) learn to map samples from a uniform distribution
(Uniform[0,1]) to corresponding quantile values of the
target distribution, removing the need for predefined quantile levels.
This allows for continuous and arbitrarily fine-grained quantile
estimation. Building on this, several models adapt IQNs for time series
forecasting. For example, IQN-RNN (Gouttes et
al. 2021) combines the IQN framework with RNNs, enabling
autoregressive probabilistic forecasts without assuming a specific
distributional form. GQFormer (Jawed and Schmidt-Thieme 2022)
introduces a novel multi-task loss that encourages both sharpness and
diversity in quantile estimates, allowing the model to capture multiple
modes in complex multivariate joint distributions. Finally,
GMQ-forecaster (Wen and Torkkola
2019) integrates IQNs with multivariate copula modeling,
using IQNs for marginals and a learned copula to represent their
dependencies, resulting in a fully generative joint distribution.
Despite achieving SOTA performance in probabilistic TSF, the
aforementioned quantile-based models remain susceptible to the quantile
crossing problem, a violation of the inherent monotonicity of quantile
functions (Park et al.
2022). Specifically, when different quantile levels (e.g.,
\(\alpha_1< \alpha_2\)) are modeled
independently, for instance via separate functions (\(r_{\alpha_1}\) and \(r_{\alpha_2}\)), it is possible to
encounter inputs \(x\) for which \(r_{\alpha_1}(x)>r_{\alpha_2}(x)\),
thereby breaking the non-decreasing property of quantile functions (Gasthaus
et
al. 2019). Therefore, as an alternative to traditional
quantile regression, one can directly model the entire quantile function
by imposing a parametric structure instead, which requires the function
to be defined over the unit interval [0,1] and to be monotonically
increasing (Benidis et al.
2022). A practical way to satisfy these constraints is
through the use of linear splines, as demonstrated in the Spline
Quantile Function (SQF)-RNN framework (Gasthaus et
al. 2019), where an RNN is trained to produce the parameters
of a monotonic spline. By construction, this avoids quantile crossing
and allows the full quantile function to be recovered from a compact
parameterization. However, the SQF-RNN model has reduced flexibility in
tail regions (Park et al.
2022). Hence, Park et al. (2022)
propose Incremental (Spline) Quantile Functions, which extend SQF by
adding tail extrapolation strategies, such as exponential Pareto
distributions. Finally, the Multivariate Quantile Function Forecaster
(MQF\(^2\)) (Kan et al. 2022) models the multivariate
quantile function as the gradient of a convex function, parametrized via
partially input convex neural networks (Amos et al. 2017;
Huang et al.
2020).
Multivariate dependencies.
The probabilistic models considered so far involve an additional layer of complexity: the presence of multivariate dependencies in modeling the conditional distribution \(p(\mathbf{x}_{L+1:L+H} \mid \mathbf{x}_{1:L})\). These dependencies can be broadly categorized along two primary axes.
-
Cross-series (multi-channel) dependencies, for instance involving the modeling of the multivariate distribution \(p(\mathbf{x}_t)\) across all \(N\) series at time step \(t\), with \(\mathbf{x}_t \in \mathbb{R}^N\). Methods that model these dependencies are often simply classified as multivariate probabilistic forecasting methods (Benidis et al. 2022).
-
Temporal dependencies, for example referring to the joint modeling of the future horizon for a single series, \(p(\mathbf{x}^{(i)}_{L+1:L+H})\), where \(\mathbf{x}^{(i)}_{L+1:L+H} \in \mathbb{R}^H\).
Models may capture neither, one, or both forms of dependencies. Nevertheless, in the following, our analysis focuses primarily on temporal dependencies. This emphasis is motivated by the observation that DMS models, albeit dominant in point forecasting (see Table [tab:ltsf]), generate all future steps in a single forward pass conditioned only on the input context. Moreover, this means that DMS models cannot naturally model how future points depend on each other, instead assuming independence between future steps (Taieb et al. 2012), unless such dependencies are explicitly modeled. In contrast, probabilistic IMS methods typically factorize the joint temporal distribution \(p(\mathbf{x}_{L+1:L+H} \mid \mathbf{x}_{1:L})\) into a product of conditional one-step distributions, as shown in Equation [eq:IMS] of Chapter [ch:preliminaries]. Therefore, they naturally model temporal dependencies as sequential step-by-step predictions. In addition to this, many point LTSF models adopt a CI design, explicitly excluding cross-series dependencies. As such, cross-series interactions are often not represented in the underlying model architecture, making it difficult to include the probabilistic multi-channel dependencies during the transition from point LTSF to probabilistic LTSF. Irrespective of the type of multivariate dependencies being modeled (cross-series, temporal, or both) various modeling strategies can be employed to approximate multivariate distributions. Two key challenges in modeling multivariate distributions are the positive-definiteness constraint and the quadratic complexity of estimating full covariance matrices, which requires \(O(N^2)\) parameters, where \(N=H\) in the case of temporal dependencies (Pourahmadi 2011). To address this, one approach is to impose structural constraints on the covariance matrix by using a low-rank plus diagonal decomposition, which ensures positive-definiteness and reduces the number of parameters (Y. Wu et al. 2020; Horn and Johnson 2012). An alternative strategy is to bypass direct estimation of the covariance structure and instead use copulas, which provide a way to decouple the modeling of marginal distributions from their joint dependence structure (Wilson and Ghahramani 2010; Größer and Okhrin 2022). According to Sklar’s theorem (Sklar 1959), any multivariate distribution can be expressed in terms of its marginals and a copula function that captures the dependencies among variables. This decomposition is particularly useful in time series settings, where marginals can exhibit diverse characteristics (e.g., skewness, seasonality) and dependencies may be nonlinear and time-varying (e.g., temporal autocorrelation or inter-series interactions) (Salinas et al. 2019). GPVar (Salinas et al. 2019) is an example of a parametric copula approach, in which the marginals are estimated independently and then transformed into a latent Gaussian space, where dependencies are modeled using a Gaussian copula with a low-rank plus diagonal covariance structure. Although this reduces the parameter complexity from \(O(N^2)\) to \(O(N)\), parametric approaches make strong assumptions about the underlying data distribution (Ashok et al. 2023). To address this, nonparametric copula estimators such as TACTiS and TACTiS-2 (Drouin et al. 2022; Ashok et al. 2023) offer greater flexibility by learning the copula function directly from the data without assuming a fixed parametric form. TACTiS (Drouin et al. 2022) models the copula density autoregressively in the copula-transformed space using a transformer-based architecture, where each variable is conditioned on previously modeled ones. To ensure that the learned copula is valid i.e., invariant to the ordering of variables, TACTiS averages over multiple variable permutations during training. However, achieving full permutation invariance would require averaging over all possible orderings, which results in factorial complexity \(O(N!)\) in the number of variables (with \(N = H\) for temporal dependencies) (Ashok et al. 2023). TACTiS-2 (Ashok et al. 2023) mitigates this by implementing a two-stage training protocol to learn marginals and the copula. It first fits normalizing flows to estimate marginal CDFs, then uses a fixed-order attention decoder with causal masking to model the joint copula density. This reduces complexity from factorial \(O(!N)\) to linear \(O(N)\). Beyond the copula framework, a different direction involves directly learning the multivariate quantile function, the inverse CDF, as done by MQF\(^2\) (Kan et al. 2022). Furthermore, MQF\(^2\) represents the joint quantile function as the gradient of a convex function parameterized by an input convex neural network (ICNN) (Amos et al. 2017), a NN architecture designed to be convex with respect to a subset of its inputs through specific structural constraints. MQF\(^2\) avoids quantile crossing but introduces additional complexity during both training and inference, as evaluating the quantile function involves solving an optimization problem. Lastly, several generative approaches naturally extend to jointly model either or both of cross-series and temporal dependencies by learning full multivariate distributions. For instance, this includes diffusion (TMDM (Li et al. 2023)), VAE (D\(^3\)VAE (Y. Li et al. 2022)) or GAN (AST (S. Wu et al. 2020)) models.
LTSF and probabilistic forecasting.
So far, we have seen two primary branches of time series forecasting:
long-term point forecasting methods and probabilistic forecasting
approaches. However, integrating both approaches into long-term
distributional forecasting is still a significant open challenge (Zhang et al.
2024). Furthermore, in their analysis Zhang
et
al. (2024) find
that point LTSF methods tend to focus on a DMS strategy, due to error
accumulation effects of IMS methods on longer forecasting horizons.
Additionally, their analysis shows that probabilistic TSF methods adopt
either IMS or DMS strategies without a clear preference, which they
account to shorter forecasting horizons. Looking at Table [tab:ltsf] and Table [tab:prob_tsf], we come to the same
conclusion. Furthermore, although probabilistic methods consider mostly
the same datasets as point LTSF methods (Kollovieh et al.
2023; Rasul et
al. 2023; Drouin et al. 2022; Tong et al.
2022), there are only a few probabilistic methods listed in
Table [tab:prob_tsf], where
the
forecasting
horizon \(H\) is larger than 300. In
contrast, the longest forecasting horizon in the default LTSF setup is
720 (Zhou et al.
2021). Nevertheless, recent studies have begun to explore
modeling longer-term dependencies in probabilistic TSF. For instance,
SSSD (Alcaraz
and Strodthoff 2022) includes a long-term forecasting
experiment inspired by the point-based LTSF framework of Zhou
et al. (2021).
However, their evaluation in this setting focuses solely on point
predictions, neglecting the probabilistic forecasting performance. More
recently, SutraNets (Bergsma et al.
2023) and RATD (Liu et al.
2024) have investigated probabilistic forecasting over
extended horizons. While SutraNets consider long horizons (e.g., \(H > 300\)),
those are limited to the
MNIST dataset (LeCun et al.
1998), which is not a standard benchmark in time series
forecasting. In contrast, RATD (Liu et al.
2024) is evaluated under a more classical LTSF setting,
comparing performance against established point forecasting models, such
as iTransformer, Informer, PatchTST, TimesNet, and DLinear (Liu et al.
2023; Zhou et
al. 2021; Nie et
al. 2022; Wu et
al. 2022; Zeng et
al. 2023), as well as probabilistic models like CSDI and
D\(^3\)VAE (Tashiro
et al.
2021; Y. Li
et al. 2022). However, point LTSF models in this evaluation
are not adapted nor interpreted within a probabilistic framework. From a
different standpoint, several model-agnostic approaches have been
proposed to derive probabilistic forecasts from point forecasting
methods. VQ-TR (Rasul et al.
2023), for example, is evaluated in a probabilistic setting
alongside transformer-based methods that were originally proposed for
point LTSF, e.g. Informer, Autoformer and PatchTST (Zhou et al.
2021; Wu et
al. 2021; Nie et
al. 2022). To enable probabilistic outputs, Rasul et al. (2023)
modify deterministic TSF architectures to either implement a parametric
distributional forecasting approach with a negative binomial or
Student-t distribution or implement an IQN head. However, the
forecasting horizons considered are relatively short and not directly
comparable to common LTSF settings. TMDM (Li et al.
2023) presents a plug-and-play framework compatible with
arbitrary point forecasting models. Its core idea is to leverage the
strength of LTSF models in estimating the conditional mean, using this
to guide the generation of full predictive distributions via diffusion.
TMDM integrates architectures such as Informer, Autoformer or NSformer
(Zhou et al.
2021; Wu et
al. 2021; Liu et al. 2022) as
backbones, but also
limits forecasting to horizons no greater than 196 steps. Finally,
ProbCast (Koochali et al.
2021) introduces a GAN-based framework for transforming point
forecasting models into probabilistic ones. However, their evaluation
also does not address extended forecasting horizons and is restricted to
a simple RNN-based backbone.
To summarize, in the domain of probabilistic TSF, the diversity of forecasting setups, e.g. with respect to prediction horizons (see Table [tab:prob_tsf]), makes it challenging to establish consistent benchmarks or perform definitive model comparisons across studies. Furthermore, we do not define a clear state-of-the-art for probabilistic forecasting as we have for point LTSF, since the studies tend to be more isolated and less directly comparable. In addition to this, as discussed, only a limited number of studies have explored probabilistic methods within the LTSF setting (Alcaraz and Strodthoff 2022; Bergsma et al. 2023; Liu et al. 2024). Likewise, only a few works have attempted extending point LTSF architectures to support probabilistic outputs (Rasul et al. 2023; Li et al. 2023; Koochali et al. 2021). Hence, in this work we aim to bridge this gap by evaluating point LTSF methods in a probabilistic setting, focusing on longer forecasting horizons than prior studies and assessing them using probabilistic metrics. In doing so, we also analyze the behavior of IMS and DMS strategies more in depth, identifying cases where DMS methods fail to yield reliable probabilistic forecasts. To properly train and evaluate probabilistic methods, suitable evaluation metrics are necessary, which is the subject of the following and final section of the related work.
| Model | Venue | IMS/DMS | Backbone | CI/CD | H | Prob. Method |
| MQ-RNN/MQ-CNN (Wen et al. 2018) | NeurIPS’17 | DMS | CNN/ RNN | CI | 52 | Quantiles |
| DeepState (Rangapuram et al. 2018) | NeurIPS’18 | DMS | SSM/ RNN | CI | 168 | Prob. Latent Variable |
| GP-Copula (Salinas et al. 2019) | NeurIPS’19 | IMS | RNN | CD | 30 | Distributional & Copula |
| GMQ (Wen and Torkkola 2019) | ICML’19 | IMS | CNN/ RNN | CD | 30 | IQN & Copula |
| SQF-RNN (Gasthaus et al. 2019) | AISTATS’19 | IMS | RNN | CI | 60 | Quantile Function |
| DeepTCN (Chen et al. 2019) | MileTS’19 | DMS | CNN | CD | 31 | Distributional/ Quantiles |
| AST (S. Wu et al. 2020) | NeurIPS’20 | IMS/DMS | Transformer(E-D) | CI | 168 | GAN |
| DeepAR (Salinas et al. 2020) | IJF’20 | IMS | RNN | CI | 52 | Distributional |
| ProTran (Tang and Matteson 2021) | NeurIPS’21 | DMS | SSM/Transformer(E) | CD | 30 | Prob. Latent Variable |
| CSDI (Tashiro et al. 2021) | NeurIPS’21 | DMS | Transformer(E) | CD | 24 | Diffusion |
| TimeGrad (Rasul et al. 2021) | ICML’21 | IMS | RNN | CD | 30 | Diffusion |
| IQN-RNN (Gouttes et al. 2021) | ICML’21 | IMS | RNN | CI | 30 | IQN |
| LSTM-Real-NVP/ Transformer-MAF (Rasul et al. 2020) | ICLR’21 | IMS | RNN/ Transformer(E-D) | CD | 30 | Flow |
| VSMHN (Li et al. 2021) | AAAI’21 | IMS | RNN | CD | 24 | VAE |
| TLAE (Nguyen and Quanz 2021) | AAAI’21 | DMS | AE/RNN | CD | 24 | VAE |
| ProbCast (Koochali et al. 2021) | Eng. Proc.’21 | DMS | RNN | CD | 24 | GAN |
| TFT (Lim et al. 2021) | IJF’21 | DMS | RNN | CI | 30 | Quantiles |
| D\(^3\)VAE (Y. Li et al. 2022) | NeurIPS’22 | DMS | MLP | CI | 64 | Diffusion & VAE |
| C2FAR (Bergsma et al. 2022) | NeurIPS’22 | IMS | RNN | CD | 48 | Distributional |
| TACTiS (Drouin et al. 2022) | ICML’22 | IMS | Transformer(E) | CI | 72 | Flow & Copula |
| EMSSM (Sun et al. 2022) | IJCAI’22 | IMS | SSM & RNN | CI | 30 | Flow & Prob. Latent Variables |
| MQF\(^2\) (Kan et al. 2022) | AISTATS’22 | DMS | RNN & PICNN (Amos et al. 2017) | CD | 24 | Quantile Function |
| ISQF (Park et al. 2022) | AISTATS’22 | DMS | CNN | CI | 30 | Quantile Function |
| DNLSSM (Du et al. 2023) | CAAI’22 | IMS | RNN & SSM | CI | 196 | Prob. Latent Variable |
| GQFormer (Jawed and Schmidt-Thieme 2022) | IEEE Big Data’22 | DMS | Transformer(E) | CD | 168 | IQN |
| VQ-AR (Rasul et al. 2022) | - | IMS | RNN | CI | 30 | Distributional |
| TSDiff (Kollovieh et al. 2023) | NeurIPS’23 | DMS | SSM | CI | 48 | Diffusion |
| SutraNets (Bergsma et al. 2023) | NeurIPS’23 | IMS | RNN | CI | 392 | Distributional |
| DSDP-GP (Biloš et al. 2023) | ICML’23 | DMS | RNN | CI | 30 | Diffusion |
| mr-diff (Shen et al. 2023) | ICLR’24 | DMS | MLP | CD | 24 | Diffusion |
| SSSD (Alcaraz and Strodthoff 2022) | TMLR’23 | DMS | SSM | CD | 672 | Diffusion |
| pTSE (Zhou et al. 2023) | IJCAI’23 | DMS | Ensemble | CD | 24 | HMM |
| BiTCN (Sprangers et al. 2023) | IJF’23 | DMS | CNN | CD | 30 | Distributional |
| PDTrans (Tong et al. 2022) | SDM’23 | IMS & DMS | Transformer(E-D) | CI | 168 | Prob. Latent Variable |
| RATD (Liu et al. 2024) | NeurIPS’24 | DMS | Transformer(E)/RAG/ DiffWave | CD | 336 | Diffusion |
| D\(^3\)M (Yan et al. 2024) | ICML’24 | DMS | SSM & WaveNet (CNN) | CD | 30 | Diffusion & Flow |
| TACTiS-2 (Ashok et al. 2023) | ICLR’24 | IMS | Transformer(E) | CI | 72 | Flow & Copula |
| TMDM (Li et al. 2023) | ICLR’24 | DMS | Transformer(E) | CD | 192 | Diffusion |
| VQ-TR (Rasul et al. 2023) | ICLR’24 | IMS | Transformer (E-D) | CI | 48 | Distributional |
| LDT (Feng, Miao, Zhang, et al. 2024) | AAAI’24 | DMS | Transformer(E-D)-AE | CD | 48 | Diffusion |
| MANF (Feng, Miao, Xu, et al. 2024) | IEEE TKDE’24 | DMS | Transformer(E) | CD | 30 | Flow |
| GPF-WI (Wang et al. 2024) | CISS’24 | DMS | AE | CD | 24 | VAE |