Probabilistic Time Series Forecasting

Kai Reffert

July 14, 2025

Related Work on Probabilistic Forecasting

Traditionally, point or single-valued forecasting methods have been among the most common forecasting techniques due to their simplicity (Benidis et al. 2022). However, these methods lack information about uncertainties of their predictions, which can be a major disadvantage when the forecasts are used in decision-making. Hence, Gneiting and Katzfuss (2014) state that forecasts should take on a probabilistic form, as this enables the modelling of uncertainties in forecasts. The general goal is to produce a probabilistic estimate for \(p(x_{L+1:L+H}| x_{1:L})\). Moreover, this distribution can be represented equivalently by its probability density function (PDF), the cumulative density function (CDF) or its inverse, the quantile function (Benidis et al. 2022). In the following, we present various methods for generating probabilistic forecasts, however precise formulations of adopted approaches are not presented.

Parametric Distributional Forecasting.

A common approach to produce an estimate for \(p(x_{L+1:L+H} | x_{1:L})\) is via parametric distributional forecasting, where models typically output location and spread parameters of a pre-chosen probability distribution, which can be maximized via the log-likelihood with respect to the ground-truth \(x_{L+1:L+H}\) (Bergsma et al. 2022). For instance, an early NN-based example is the work of Nix and Weigend (1994), where neural networks were trained to output the mean \(\hat{\mu}\) and variance \(\hat{\sigma}^2\) of a Gaussian distribution for regression tasks. While Gaussian likelihoods are common, alternative distributions such as Student-t (Alexandrov et al. 2020), negative binomial (Salinas et al. 2020) or Gaussian mixture distributions (Mukherjee et al. 2018) have been used depending on the statistical properties of the data. DeepAR (Salinas et al. 2020), an IMS CI RNN-based approach, adopts parametric distributional forecasting with a negative binomial distribution to model demand data. Building on this, VQ-AR (Rasul et al. 2022) combines the DeepAR backbone with a Vector Quantized-Variational Autoencoder (VQ-VAE) architecture (Van den Oord et al. 2017), introducing a discrete latent bottleneck that captures recurring temporal patterns in probabilistic forecasting. Extending this idea to Transformer-based architectures, VQ-TR (Rasul et al. 2023) integrates VQ-VAE into the attention mechanism of Transformers. Lastly, CNN-based models like BiTCN (Sprangers et al. 2023) and DeepTCN (Chen et al. 2019) also leverage parametric distributional forecasting but with a DMS CD strategy. A disadvantage of these approaches is that they require an a priori choice of the distributional form and are limited to the parametric assumptions, which may not capture complex data dynamics. To address the limitations of fixed-form parametric models, non-parametric approaches have gained interest in the community.

Flexible Density Estimation.

Among non-parametric approaches, normalizing flows (Tabak and Turner 2013; Papamakarios et al. 2021) have emerged as a powerful tool for flexible density estimation. Normalizing flows, such as Real NVP (Dinh et al. 2017) and Masked Autoregressive Flow (MAF) (Papamakarios et al. 2017), transform a simple base distribution, e.g. isotropic Gaussian, into a complex target distribution through a series of invertible and differentiable mappings. Invertibility ensures the preservation of probability mass and enables the evaluation of the corresponding density function at all points (Benidis et al. 2022). A unified description of normalizing flows and their core principles is provided by Papamakarios et al. (2021). For time series forecasting, Rasul et al. (2020) combine IMS backbones, e.g. RNN and Transformer, with conditioned normalizing flows to capture multivariate temporal dependencies without restrictive parametric assumptions. Furthermore, MANF (Feng, Miao, Xu, et al. 2024) combines conditioned normalizing flows with multi-scale attention and relative positional encoding to model multivariate dependencies efficiently in a DMS fashion. TACTiS and TACTiS-2 (Drouin et al. 2022; Ashok et al. 2023) introduce an IMS Transformer-based Copula model using Deep Sigmoidal Flows (Huang et al. 2018) to estimate marginal CDFs. Although flow-based models retain tractable likelihood computation via the change of variables formula, they have problems with discrete data distributions, often present in TSF applications (e.g., sales data) (Rasul et al. 2020). Moreover, learning highly flexible continuous distributions for discrete data may encourage learning distributions with spiking densities at each possible discrete value (Uria et al. 2013). Altogether, this hinders training the models to maximize the likelihood (Bergsma et al. 2022). To address this, dequantizing (Rasul et al. 2020), e.g. adding Uniform[0,1] noise, may be applied to bound the log-likelihood (Theis et al. 2016). However, this assumes that the discrete nature of the series is known in advance and that the potential loss in precision is justified relative to the advantages offered by continuous models (Bergsma et al. 2022). In contrast to flow-based models, C2FAR (Bergsma et al. 2022) represents time series variables through a hierarchical sequence of categorical distributions, built on top of the DeepAR framework. Instead of relying on fixed parametric forms, C2FAR generates increasingly finer intervals of support in an autoregressive manner, where each step is conditioned on coarser previous intervals. This hierarchical discretization allows it to better capture multi-modal behaviors and extreme values compared to flat binning or standard parametric approaches. Building on this, SutraNets (Bergsma et al. 2023) extend the idea to long-term time series problems by dividing inputs into frequency-based sub-series, which is different from the regular patching of adjacent time steps as done in models like PatchTST (Nie et al. 2022) or SegRNN (Lin et al. 2023). Each sub-series in SutraNets is modeled by its own C2FAR-LSTM.

Generative Diffusion Models.

While normalizing flows offer tractable likelihoods and exact sampling through invertible transformations, their structural constraint, particularly invertibility and the need for tractable Jacobians, can limit their expressiveness in modeling complex, high-dimensional, and multimodal distributions (Benidis et al. 2022). In contrast, energy-based models (EBMs) relax these constraints by modeling unnormalized log-probabilities, especially important in high-dimensional spaces (LeCun et al. 2006). However, EBMs are notoriously difficult to train due to challenges in sampling and normalizing (Du and Mordatch 2019). Diffusion probabilistic models (Ho et al. 2020; Sohl-Dickstein et al. 2015; Graikos et al. 2022), such as the well-known denoising diffusion probabilistic model (DDPM) (Ho et al. 2020), can be viewed as a practical compromise: they implicitly learn energy gradients via score-based training (Hyvärinen 2005; Song et al. 2021) and enable stable sampling using Langevin-like denoising processes (Neal 2011; Welling and Teh 2011). On a high level, diffusion probabilistic models operate by first applying a forward process that gradually corrupts data into noise, followed by a reverse process that reconstructs the original data from the noise (Ho et al. 2020). Over the last few years, diffusion-based generative models have emerged as strong generative tools, achieving SOTA performances in text generation (X. Li et al. 2022), audio (Kong et al. 2020) and image synthesis (Dhariwal and Nichol 2021). In the context of TSF, diffusion-based models often guide generation by conditioning on partial observations, reference samples, decompositions or architectural priors such as RNNs, transformers or state space models (Rasul et al. 2021; Tashiro et al. 2021; Alcaraz and Strodthoff 2022; Shen and Kwok 2023; Liu et al. 2024; Shen et al. 2023). For instance, TimeGrad (Rasul et al. 2021) uses RNN hidden states, while TimeDiff (Shen and Kwok 2023) incorporates task-specific conditioning like future mixup and autoregressive initialization. On the other hand, mr-Diff (Shen et al. 2023) leverages the multi-scale structure of time series by conditioning the denoising process on progressively refined trends, starting from coarse to fine levels. Similarly, TMDM (Li et al. 2023) and RATD (Liu et al. 2024) integrate transformer-based and retrieval-augmented conditioning, respectively. Going one step further, D\(^3\)M (Yan et al. 2024) introduces a decomposable denoising diffusion framework that unifies continuous flow models and diffusion models, achieving high-speed generation with fewer diffusion steps. Alternatively, TSDiff (Kollovieh et al. 2023) mitigates explicit conditioning during training by adopting an unconditional framework, using a self-guidance mechanism at inference time to adapt to tasks like forecasting and refinement without auxiliary networks. Contrary to previous approaches, DSPD-GP (Biloš et al. 2023) treats the time series as continuous functions rather than discrete measurements, defining diffusion not over discrete vectors but over functions, enabling direct handling of irregularly-sampled time series. Moreover, their Stochastic Process Diffusion framework applies diffusion in function space, using correlated noise from Gaussian processes to preserve temporal continuity and handle irregular sampling. On the other hand, D\(^3\)VAE (Y. Li et al. 2022) proposes a bidirectional VAE augmented with a coupled diffusion process, which simultaneously diffuses input and target series to reduce uncertainty. It further integrates denoising score matching and disentangled latent variables to improve interpretability and robustness, demonstrating strong performance on short and noisy time series.

Latent Generative methods.

While diffusion-based models have gained traction in generative modeling for their flexibility and ability to capture complex, multimodal distributions, they often suffer from high computational cost and slow sampling, limitations that can be especially problematic in time-sensitive or resource-constrained forecasting scenarios (Yeğin and Amasyalı 2024; Yang et al. 2024). In contrast, latent-variable approaches such as Variational Autoencoders and (probabilistic) State Space Models offer a compelling alternative by trading off expressivity for faster inference and improved interpretability (Tong et al. 2022; Bézenac et al. 2020).
Variational Autoencoders (VAEs) (Kingma and Welling 2014; Rezende et al. 2014) simplify generative modeling by learning to represent complex data distributions in a lower-dimensional latent space using variational inference. In variational inference, the main idea is to approximate the true distribution with a simpler distribution, e.g. Gaussian, and minimize the Kullback-Leibler (KL) divergence, shown in the upcoming Section [sec:related_work_Prob_Scores] in Equation [eq:KL], between the approximate and true distribution, also known as evidence lower bound optimization (Yeğin and Amasyalı 2024). Broadly speaking, VAEs first encode the data into a lower-dimensional latent space where a simpler probabilistic model can be imposed, then forecasts are generated by decoding samples drawn from this latent distribution back into the observation space. Compared to other likelihood-based models like normalizing flows or energy-based models, VAEs offer efficient, tractable sampling and readily accessible inference via encoder networks (Vahdat and Kautz 2020). Recent advances in other generative tasks, e.g. image generation (Vahdat and Kautz 2020; Luo et al. 2025), have motivated an adoption in time series forecasting tasks. For instance, the Temporal Latent AutoEncoder (TLAE) (Nguyen and Quanz 2021) introduces a nonlinear factorization framework for multivariate time series, enabling end-to-end learning of complex latent dynamics while preserving scalability. Similarly, VSMHN (Li et al. 2021), based on the conditional VAE (Sohn et al. 2015), handles asynchronous event-driven data with aligned time encodings to jointly forecast across heterogeneous temporal sources. To better address long-range dependencies and structural interpretability, PDTrans (Tong et al. 2022) fuses Transformer-based temporal modeling with a VAE-based latent decomposition, providing interpretable forecasts through trend and seasonality disentanglement. The Latent Diffusion Transformer (LDT) (Feng, Miao, Zhang, et al. 2024) compresses high-dimensional multivariate series into latent representations using a statistics-aware autoencoder and generates forecasts via a diffusion-based generator with self-conditioning. Likewise, the D\(^3\)VAE (Y. Li et al. 2022) model fuses the Nouveau VAE (Vahdat and Kautz 2020) with a coupled diffusion process and multiscale denoising to enhance robustness under limited or noisy data.
State Space Models (SSMs)(Hyndman et al. 2002; Seeger et al. 2016; Durbin and Koopman 2012) provide a framework for modeling sequential data via latent variables that evolve over time according to structured transition dynamics and generate observations through stochastic emission processes. Commonly implemented deterministic time series forecasting methods such as Exponential Smoothing (Hyndman et al. 2002; Hyndman et al. 2008), ARIMA (Box and Pierce 1970), and LSTMs (Hochreiter and Schmidhuber 1997) can all be interpreted as special cases or deterministic approximations of SSMs (Durbin and Koopman 2012). A classic probabilistic instance is the linear-Gaussian SSM (Roweis and Ghahramani 1999), which offers closed-form solutions for filtering, smoothing, and likelihood computation, enabling efficient handling of missing data, and tractable multi-step forecasting with full uncertainty quantification (avoiding error accumulation) (Bézenac et al. 2020). Furthermore, in their probabilistic form, SSMs define joint distributions over latent states and observations, such as in Gaussian Process SSMs (Ko and Fox 2011; Deisenroth et al. 2012). This enables uncertainty quantification through inference techniques like Kalman filtering (Kalman 1960), treating the hidden state as a distribution that is recursively updated and propagated to generate predictive distributions over future observations. In probabilistic TSF, recent works have explored hybrid models that combine the structural benefits of SSMs with the expressivity of deep learning, where one prominent direction is the use of deep neural networks to parameterize SSM components. For example, DeepState (Rangapuram et al. 2018) employs an RNN to learn a shared global mapping from covariates to the parameters of a linear-Gaussian SSM, in which each individual time series has an associated linear Gaussian SSM whose parameters are dynamically generated by the RNN. Similarly, DNLSSM (Du et al. 2023) integrates LSTM networks with the unscented Kalman filter (Julier and Uhlmann 2004) to better model non-linear dynamics. Extending these ideas to attention-based temporal dependencies, the Probabilistic Transformer (ProTran) (Tang and Matteson 2021) replaces recurrent components with a Transformer architecture, enabling the model to learn non-Markovian dependencies in the latent space via self-attention. From a different standpoint, PR-SSM (Doerr et al. 2018) substitute the parametric transition dynamics with Gaussian Processes, where training relies on doubly stochastic variational inference. To address the restrictive assumption of Gaussianity, flow-based SSMs have been proposed. For instance, the Normalizing Kalman Filter (NKF) (Bézenac et al. 2020) augments a linear-Gaussian SSM by integrating normalizing flows into the observation model. This allows the latent state dynamics to remain analytically tractable via Kalman filtering, while enabling the observation distribution to flexibly model complex, non-Gaussian behavior. Similarly, EMSSM (Sun et al. 2022) implements an external memory mechanism and conditional normalizing flows, enhancing the model’s capacity to capture long-range dependencies and adapt to distributional shifts.

Generative Adversarial Paradigms.

Generative Adversarial Networks (GANs) (Goodfellow et al. 2014), on the other hand, bypass likelihood estimation altogether, directly learning to generate realistic samples through adversarial training. Furthermore, GANs typically consist of two NNs, a generator and a discriminator. In their min-max adversarial training game, the generator creates synthetic samples while the discriminator is tasked with distinguishing between artificial samples from the generator and real samples (S. Wu et al. 2020). GAN-based models have demonstrated impressive results in various generative tasks, such as image generation (Jiang et al. 2021) and audio synthesis (Donahue et al. 2018). Similarly, GANs primary focus in time series applications also lies in time series synthesis and generation, e.g. see timeGAN (Yoon et al. 2019), GT-GAN (Jeon et al. 2022) or GAN-based approaches in the Time Series Generative Modeling benchmark (Nikitin et al. 2024). Nevertheless, GAN-based probabilistic time series forecasting approaches were implemented as well. For instance, the Adversarial Sparse Transformer (AST) (S. Wu et al. 2020) combines a sparse Transformer with adversarial training to improve time series forecasting. AST employs a generator-discriminator framework, where the generator learns sparse attention patterns for forecasting, while the discriminator ensures sequence-level fidelity. Furthermore, Koochali et al. (2021) introduce ProbCast, a probabilistic forecasting model based on conditional GAN (Mirza and Osindero 2014). On a final note, despite their potential, GANs remain notoriously difficult to train, often facing challenges such as hyperparameter sensitivity, training instability, and mode collapse (Yeğin and Amasyalı 2024).

Quantile.

In practice, specifying an exact (parametric) distribution is often unnecessary. Instead, estimating a few key quantiles is sufficient for making optimal decisions, e.g. in epidemiologic forecasting (Bracher et al. 2021; Ray et al. 2020), wind power forecasting (Wan et al. 2017) or the classical newsvendor problem (Gneiting et al. 2023; Tarima and Zenkova 2020; Harsha et al. 2021), by helping quantify uncertainty and minimize losses (Bracher et al. 2021; Wen et al. 2018; Gneiting et al. 2023). As a result, a common non-parametric approach to modeling \(p(x_{L+1:L+H} | x_{1:L})\) is via the quantile function, where the models are typically trained using the quantile loss (see Equation [eq:QS] of Section [sec:related_work_Prob_Scores]). Quantile regression methods (Koenker and Bassett 1978; Koenker 2005) are frequently applied in forecasting, either by predicting a single quantile (for point forecasts) or multiple quantiles at once (Wen et al. 2018). Effectively, this method approximates the quantile function by targeting specific quantile levels (Benidis et al. 2022). Two notable approaches of this kind are the Multi-horizon Quantile Recurrent Forecaster (MQ-RNN) (Wen et al. 2018) and the Temporal Fusion Transformer (TFT) (Lim et al. 2021). MQ-RNN combines sequence-to-sequence NNs with quantile regression and a DMS strategy. On the other hand, TFT leverages attention mechanisms for interpretable DMS forecasting, using specialized components like gating mechanisms and variable selection networks to provide insights into temporal dynamics. Nevertheless, these approaches require the specification of a set of quantile levels a priori, necessitating retraining when querying new quantile levels (Park et al. 2022). As another option, Implicit Quantile Networks (IQNs) (Dabney et al. 2018) learn to map samples from a uniform distribution (Uniform[0,1]) to corresponding quantile values of the target distribution, removing the need for predefined quantile levels. This allows for continuous and arbitrarily fine-grained quantile estimation. Building on this, several models adapt IQNs for time series forecasting. For example, IQN-RNN (Gouttes et al. 2021) combines the IQN framework with RNNs, enabling autoregressive probabilistic forecasts without assuming a specific distributional form. GQFormer (Jawed and Schmidt-Thieme 2022) introduces a novel multi-task loss that encourages both sharpness and diversity in quantile estimates, allowing the model to capture multiple modes in complex multivariate joint distributions. Finally, GMQ-forecaster (Wen and Torkkola 2019) integrates IQNs with multivariate copula modeling, using IQNs for marginals and a learned copula to represent their dependencies, resulting in a fully generative joint distribution. Despite achieving SOTA performance in probabilistic TSF, the aforementioned quantile-based models remain susceptible to the quantile crossing problem, a violation of the inherent monotonicity of quantile functions (Park et al. 2022). Specifically, when different quantile levels (e.g., \(\alpha_1< \alpha_2\)) are modeled independently, for instance via separate functions (\(r_{\alpha_1}\) and \(r_{\alpha_2}\)), it is possible to encounter inputs \(x\) for which \(r_{\alpha_1}(x)>r_{\alpha_2}(x)\), thereby breaking the non-decreasing property of quantile functions (Gasthaus et al. 2019). Therefore, as an alternative to traditional quantile regression, one can directly model the entire quantile function by imposing a parametric structure instead, which requires the function to be defined over the unit interval [0,1] and to be monotonically increasing (Benidis et al. 2022). A practical way to satisfy these constraints is through the use of linear splines, as demonstrated in the Spline Quantile Function (SQF)-RNN framework (Gasthaus et al. 2019), where an RNN is trained to produce the parameters of a monotonic spline. By construction, this avoids quantile crossing and allows the full quantile function to be recovered from a compact parameterization. However, the SQF-RNN model has reduced flexibility in tail regions (Park et al. 2022). Hence, Park et al. (2022) propose Incremental (Spline) Quantile Functions, which extend SQF by adding tail extrapolation strategies, such as exponential Pareto distributions. Finally, the Multivariate Quantile Function Forecaster (MQF\(^2\)) (Kan et al. 2022) models the multivariate quantile function as the gradient of a convex function, parametrized via partially input convex neural networks (Amos et al. 2017; Huang et al. 2020).

Multivariate dependencies.

The probabilistic models considered so far involve an additional layer of complexity: the presence of multivariate dependencies in modeling the conditional distribution \(p(\mathbf{x}_{L+1:L+H} \mid \mathbf{x}_{1:L})\). These dependencies can be broadly categorized along two primary axes.

Models may capture neither, one, or both forms of dependencies. Nevertheless, in the following, our analysis focuses primarily on temporal dependencies. This emphasis is motivated by the observation that DMS models, albeit dominant in point forecasting (see Table [tab:ltsf]), generate all future steps in a single forward pass conditioned only on the input context. Moreover, this means that DMS models cannot naturally model how future points depend on each other, instead assuming independence between future steps (Taieb et al. 2012), unless such dependencies are explicitly modeled. In contrast, probabilistic IMS methods typically factorize the joint temporal distribution \(p(\mathbf{x}_{L+1:L+H} \mid \mathbf{x}_{1:L})\) into a product of conditional one-step distributions, as shown in Equation [eq:IMS] of Chapter [ch:preliminaries]. Therefore, they naturally model temporal dependencies as sequential step-by-step predictions. In addition to this, many point LTSF models adopt a CI design, explicitly excluding cross-series dependencies. As such, cross-series interactions are often not represented in the underlying model architecture, making it difficult to include the probabilistic multi-channel dependencies during the transition from point LTSF to probabilistic LTSF. Irrespective of the type of multivariate dependencies being modeled (cross-series, temporal, or both) various modeling strategies can be employed to approximate multivariate distributions. Two key challenges in modeling multivariate distributions are the positive-definiteness constraint and the quadratic complexity of estimating full covariance matrices, which requires \(O(N^2)\) parameters, where \(N=H\) in the case of temporal dependencies (Pourahmadi 2011). To address this, one approach is to impose structural constraints on the covariance matrix by using a low-rank plus diagonal decomposition, which ensures positive-definiteness and reduces the number of parameters (Y. Wu et al. 2020; Horn and Johnson 2012). An alternative strategy is to bypass direct estimation of the covariance structure and instead use copulas, which provide a way to decouple the modeling of marginal distributions from their joint dependence structure (Wilson and Ghahramani 2010; Größer and Okhrin 2022). According to Sklar’s theorem (Sklar 1959), any multivariate distribution can be expressed in terms of its marginals and a copula function that captures the dependencies among variables. This decomposition is particularly useful in time series settings, where marginals can exhibit diverse characteristics (e.g., skewness, seasonality) and dependencies may be nonlinear and time-varying (e.g., temporal autocorrelation or inter-series interactions) (Salinas et al. 2019). GPVar (Salinas et al. 2019) is an example of a parametric copula approach, in which the marginals are estimated independently and then transformed into a latent Gaussian space, where dependencies are modeled using a Gaussian copula with a low-rank plus diagonal covariance structure. Although this reduces the parameter complexity from \(O(N^2)\) to \(O(N)\), parametric approaches make strong assumptions about the underlying data distribution (Ashok et al. 2023). To address this, nonparametric copula estimators such as TACTiS and TACTiS-2 (Drouin et al. 2022; Ashok et al. 2023) offer greater flexibility by learning the copula function directly from the data without assuming a fixed parametric form. TACTiS (Drouin et al. 2022) models the copula density autoregressively in the copula-transformed space using a transformer-based architecture, where each variable is conditioned on previously modeled ones. To ensure that the learned copula is valid i.e., invariant to the ordering of variables, TACTiS averages over multiple variable permutations during training. However, achieving full permutation invariance would require averaging over all possible orderings, which results in factorial complexity \(O(N!)\) in the number of variables (with \(N = H\) for temporal dependencies) (Ashok et al. 2023). TACTiS-2 (Ashok et al. 2023) mitigates this by implementing a two-stage training protocol to learn marginals and the copula. It first fits normalizing flows to estimate marginal CDFs, then uses a fixed-order attention decoder with causal masking to model the joint copula density. This reduces complexity from factorial \(O(!N)\) to linear \(O(N)\). Beyond the copula framework, a different direction involves directly learning the multivariate quantile function, the inverse CDF, as done by MQF\(^2\) (Kan et al. 2022). Furthermore, MQF\(^2\) represents the joint quantile function as the gradient of a convex function parameterized by an input convex neural network (ICNN) (Amos et al. 2017), a NN architecture designed to be convex with respect to a subset of its inputs through specific structural constraints. MQF\(^2\) avoids quantile crossing but introduces additional complexity during both training and inference, as evaluating the quantile function involves solving an optimization problem. Lastly, several generative approaches naturally extend to jointly model either or both of cross-series and temporal dependencies by learning full multivariate distributions. For instance, this includes diffusion (TMDM (Li et al. 2023)), VAE (D\(^3\)VAE (Y. Li et al. 2022)) or GAN (AST (S. Wu et al. 2020)) models.

LTSF and probabilistic forecasting.

So far, we have seen two primary branches of time series forecasting: long-term point forecasting methods and probabilistic forecasting approaches. However, integrating both approaches into long-term distributional forecasting is still a significant open challenge (Zhang et al. 2024). Furthermore, in their analysis Zhang et al. (2024) find that point LTSF methods tend to focus on a DMS strategy, due to error accumulation effects of IMS methods on longer forecasting horizons. Additionally, their analysis shows that probabilistic TSF methods adopt either IMS or DMS strategies without a clear preference, which they account to shorter forecasting horizons. Looking at Table [tab:ltsf] and Table [tab:prob_tsf], we come to the same conclusion. Furthermore, although probabilistic methods consider mostly the same datasets as point LTSF methods (Kollovieh et al. 2023; Rasul et al. 2023; Drouin et al. 2022; Tong et al. 2022), there are only a few probabilistic methods listed in Table [tab:prob_tsf], where the forecasting horizon \(H\) is larger than 300. In contrast, the longest forecasting horizon in the default LTSF setup is 720 (Zhou et al. 2021). Nevertheless, recent studies have begun to explore modeling longer-term dependencies in probabilistic TSF. For instance, SSSD (Alcaraz and Strodthoff 2022) includes a long-term forecasting experiment inspired by the point-based LTSF framework of Zhou et al. (2021). However, their evaluation in this setting focuses solely on point predictions, neglecting the probabilistic forecasting performance. More recently, SutraNets (Bergsma et al. 2023) and RATD (Liu et al. 2024) have investigated probabilistic forecasting over extended horizons. While SutraNets consider long horizons (e.g., \(H > 300\)), those are limited to the MNIST dataset (LeCun et al. 1998), which is not a standard benchmark in time series forecasting. In contrast, RATD (Liu et al. 2024) is evaluated under a more classical LTSF setting, comparing performance against established point forecasting models, such as iTransformer, Informer, PatchTST, TimesNet, and DLinear (Liu et al. 2023; Zhou et al. 2021; Nie et al. 2022; Wu et al. 2022; Zeng et al. 2023), as well as probabilistic models like CSDI and D\(^3\)VAE (Tashiro et al. 2021; Y. Li et al. 2022). However, point LTSF models in this evaluation are not adapted nor interpreted within a probabilistic framework. From a different standpoint, several model-agnostic approaches have been proposed to derive probabilistic forecasts from point forecasting methods. VQ-TR (Rasul et al. 2023), for example, is evaluated in a probabilistic setting alongside transformer-based methods that were originally proposed for point LTSF, e.g. Informer, Autoformer and PatchTST (Zhou et al. 2021; Wu et al. 2021; Nie et al. 2022). To enable probabilistic outputs, Rasul et al. (2023) modify deterministic TSF architectures to either implement a parametric distributional forecasting approach with a negative binomial or Student-t distribution or implement an IQN head. However, the forecasting horizons considered are relatively short and not directly comparable to common LTSF settings. TMDM (Li et al. 2023) presents a plug-and-play framework compatible with arbitrary point forecasting models. Its core idea is to leverage the strength of LTSF models in estimating the conditional mean, using this to guide the generation of full predictive distributions via diffusion. TMDM integrates architectures such as Informer, Autoformer or NSformer (Zhou et al. 2021; Wu et al. 2021; Liu et al. 2022) as backbones, but also limits forecasting to horizons no greater than 196 steps. Finally, ProbCast (Koochali et al. 2021) introduces a GAN-based framework for transforming point forecasting models into probabilistic ones. However, their evaluation also does not address extended forecasting horizons and is restricted to a simple RNN-based backbone.

To summarize, in the domain of probabilistic TSF, the diversity of forecasting setups, e.g. with respect to prediction horizons (see Table [tab:prob_tsf]), makes it challenging to establish consistent benchmarks or perform definitive model comparisons across studies. Furthermore, we do not define a clear state-of-the-art for probabilistic forecasting as we have for point LTSF, since the studies tend to be more isolated and less directly comparable. In addition to this, as discussed, only a limited number of studies have explored probabilistic methods within the LTSF setting (Alcaraz and Strodthoff 2022; Bergsma et al. 2023; Liu et al. 2024). Likewise, only a few works have attempted extending point LTSF architectures to support probabilistic outputs (Rasul et al. 2023; Li et al. 2023; Koochali et al. 2021). Hence, in this work we aim to bridge this gap by evaluating point LTSF methods in a probabilistic setting, focusing on longer forecasting horizons than prior studies and assessing them using probabilistic metrics. In doing so, we also analyze the behavior of IMS and DMS strategies more in depth, identifying cases where DMS methods fail to yield reliable probabilistic forecasts. To properly train and evaluate probabilistic methods, suitable evaluation metrics are necessary, which is the subject of the following and final section of the related work.

Model Venue IMS/DMS Backbone CI/CD H Prob. Method
MQ-RNN/MQ-CNN (Wen et al. 2018) NeurIPS’17 DMS CNN/ RNN CI 52 Quantiles
DeepState (Rangapuram et al. 2018) NeurIPS’18 DMS SSM/ RNN CI 168 Prob. Latent Variable
GP-Copula (Salinas et al. 2019) NeurIPS’19 IMS RNN CD 30 Distributional & Copula
GMQ (Wen and Torkkola 2019) ICML’19 IMS CNN/ RNN CD 30 IQN & Copula
SQF-RNN (Gasthaus et al. 2019) AISTATS’19 IMS RNN CI 60 Quantile Function
DeepTCN (Chen et al. 2019) MileTS’19 DMS CNN CD 31 Distributional/ Quantiles
AST (S. Wu et al. 2020) NeurIPS’20 IMS/DMS Transformer(E-D) CI 168 GAN
DeepAR (Salinas et al. 2020) IJF’20 IMS RNN CI 52 Distributional
ProTran (Tang and Matteson 2021) NeurIPS’21 DMS SSM/Transformer(E) CD 30 Prob. Latent Variable
CSDI (Tashiro et al. 2021) NeurIPS’21 DMS Transformer(E) CD 24 Diffusion
TimeGrad (Rasul et al. 2021) ICML’21 IMS RNN CD 30 Diffusion
IQN-RNN (Gouttes et al. 2021) ICML’21 IMS RNN CI 30 IQN
LSTM-Real-NVP/ Transformer-MAF (Rasul et al. 2020) ICLR’21 IMS RNN/ Transformer(E-D) CD 30 Flow
VSMHN (Li et al. 2021) AAAI’21 IMS RNN CD 24 VAE
TLAE (Nguyen and Quanz 2021) AAAI’21 DMS AE/RNN CD 24 VAE
ProbCast (Koochali et al. 2021) Eng. Proc.’21 DMS RNN CD 24 GAN
TFT (Lim et al. 2021) IJF’21 DMS RNN CI 30 Quantiles
D\(^3\)VAE (Y. Li et al. 2022) NeurIPS’22 DMS MLP CI 64 Diffusion & VAE
C2FAR (Bergsma et al. 2022) NeurIPS’22 IMS RNN CD 48 Distributional
TACTiS (Drouin et al. 2022) ICML’22 IMS Transformer(E) CI 72 Flow & Copula
EMSSM (Sun et al. 2022) IJCAI’22 IMS SSM & RNN CI 30 Flow & Prob. Latent Variables
MQF\(^2\) (Kan et al. 2022) AISTATS’22 DMS RNN & PICNN (Amos et al. 2017) CD 24 Quantile Function
ISQF (Park et al. 2022) AISTATS’22 DMS CNN CI 30 Quantile Function
DNLSSM (Du et al. 2023) CAAI’22 IMS RNN & SSM CI 196 Prob. Latent Variable
GQFormer (Jawed and Schmidt-Thieme 2022) IEEE Big Data’22 DMS Transformer(E) CD 168 IQN
VQ-AR (Rasul et al. 2022) - IMS RNN CI 30 Distributional
TSDiff (Kollovieh et al. 2023) NeurIPS’23 DMS SSM CI 48 Diffusion
SutraNets (Bergsma et al. 2023) NeurIPS’23 IMS RNN CI 392 Distributional
DSDP-GP (Biloš et al. 2023) ICML’23 DMS RNN CI 30 Diffusion
mr-diff (Shen et al. 2023) ICLR’24 DMS MLP CD 24 Diffusion
SSSD (Alcaraz and Strodthoff 2022) TMLR’23 DMS SSM CD 672 Diffusion
pTSE (Zhou et al. 2023) IJCAI’23 DMS Ensemble CD 24 HMM
BiTCN (Sprangers et al. 2023) IJF’23 DMS CNN CD 30 Distributional
PDTrans (Tong et al. 2022) SDM’23 IMS & DMS Transformer(E-D) CI 168 Prob. Latent Variable
RATD (Liu et al. 2024) NeurIPS’24 DMS Transformer(E)/RAG/ DiffWave CD 336 Diffusion
D\(^3\)M (Yan et al. 2024) ICML’24 DMS SSM & WaveNet (CNN) CD 30 Diffusion & Flow
TACTiS-2 (Ashok et al. 2023) ICLR’24 IMS Transformer(E) CI 72 Flow & Copula
TMDM (Li et al. 2023) ICLR’24 DMS Transformer(E) CD 192 Diffusion
VQ-TR (Rasul et al. 2023) ICLR’24 IMS Transformer (E-D) CI 48 Distributional
LDT (Feng, Miao, Zhang, et al. 2024) AAAI’24 DMS Transformer(E-D)-AE CD 48 Diffusion
MANF (Feng, Miao, Xu, et al. 2024) IEEE TKDE’24 DMS Transformer(E) CD 30 Flow
GPF-WI (Wang et al. 2024) CISS’24 DMS AE CD 24 VAE
Alcaraz, Juan Lopez, and Nils Strodthoff. 2022. “Diffusion-Based Time Series Imputation and Forecasting with Structured State Space Models.” Transactions on Machine Learning Research, December. https://openreview.net/forum?id=hHiIbk7ApW.
Alexandrov, Alexander, Konstantinos Benidis, Michael Bohlke-Schneider, et al. 2020. GluonTS: Probabilistic and Neural Time Series Modeling in Python.” Journal of Machine Learning Research 21 (116): 1–6. http://jmlr.org/papers/v21/19-820.html.
Amos, Brandon, Lei Xu, and J. Zico Kolter. 2017. “Input Convex Neural Networks.” Proceedings of the 34th International Conference on Machine Learning, July, 146–55. https://proceedings.mlr.press/v70/amos17b.html.
Ashok, Arjun, Étienne Marcotte, Valentina Zantedeschi, Nicolas Chapados, and Alexandre Drouin. 2023. TACTiS-2: Better, Faster, Simpler Attentional Copulas for Multivariate Time Series.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=xtOydkE1Ku.
Benidis, Konstantinos, Syama Sundar Rangapuram, Valentin Flunkert, et al. 2022. “Deep Learning for Time Series Forecasting: Tutorial and Literature Survey.” ACM Computing Surveys 55 (6). https://doi.org/10.1145/3533382.
Bergsma, Shane, Tim Zeyl, and Lei Guo. 2023. SutraNets: Sub-Series Autoregressive Networks for Long-Sequence, Probabilistic Forecasting.” In Advances in Neural Information Processing Systems, edited by A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, vol. 36, 36. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/6171c9e600432a42688ad61a525951bf-Paper-Conference.pdf.
Bergsma, Shane, Tim Zeyl, Javad Rahimipour Anaraki, and Lei Guo. 2022. C2FAR: Coarse-to-Fine Autoregressive Networks for Precise Probabilistic Forecasting.” In Advances in Neural Information Processing Systems, edited by S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, vol. 35, 35. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2022/file/899511e37a8e01e1bd6f6f1d377cc250-Paper-Conference.pdf.
Bézenac, Emmanuel de, Syama Sundar Rangapuram, Konstantinos Benidis, et al. 2020. “Normalizing Kalman Filters for Multivariate Time Series Analysis.” Advances in Neural Information Processing Systems 33: 2995–3007. https://proceedings.neurips.cc/paper_files/paper/2020/hash/1f47cef5e38c952f94c5d61726027439-Abstract.html.
Biloš, Marin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, and Stephan Günnemann. 2023. “Modeling Temporal Data as Continuous Functions with Stochastic Process Diffusion.” Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA), ICML’23, vol. 202 (July): 2452–70.
Box, G. E. P., and David A. Pierce. 1970. “Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models.” Journal of the American Statistical Association 65 (332): 1509–26. https://doi.org/10.1080/01621459.1970.10481180.
Bracher, Johannes, Evan L. Ray, Tilmann Gneiting, and Nicholas G. Reich. 2021. “Evaluating Epidemic Forecasts in an Interval Format.” PLOS Computational Biology 17 (2): e1008618. https://doi.org/10.1371/journal.pcbi.1008618.
Chen, Yitian, Yanfei Kang, Yixiong Chen, and Zizhuo Wang. 2019. “Probabilistic Forecasting with Temporal Convolutional Neural Network.” MileTS ’19: 5th KDD Workshop on Mining and Learning from Time Series (Anchorage, Alaska, USA), August, 11. https://doi.org/https://doi.org/10.1145/1122445.1122456.
Dabney, Will, Georg Ostrovski, David Silver, and Remi Munos. 2018. “Implicit Quantile Networks for Distributional Reinforcement Learning.” Proceedings of the 35th International Conference on Machine Learning, July, 1096–105. https://proceedings.mlr.press/v80/dabney18a.html.
Deisenroth, Marc Peter, Ryan Darby Turner, Marco F. Huber, Uwe D. Hanebeck, and Carl Edward Rasmussen. 2012. “Robust Filtering and Smoothing with Gaussian Processes.” IEEE Transactions on Automatic Control 57 (7): 1865–71. https://doi.org/10.1109/TAC.2011.2179426.
Dhariwal, Prafulla, and Alexander Nichol. 2021. “Diffusion Models Beat GANs on Image Synthesis.” Advances in Neural Information Processing Systems 34: 8780–94. https://proceedings.neurips.cc/paper/2021/hash/49ad23d1ec9fa4bd8d77d02681df5cfa-Abstract.html.
Dinh, Laurent, Jascha Sohl-Dickstein, and Samy Bengio. 2017. “Density Estimation Using Real NVP.” Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), February. https://openreview.net/forum?id=HkpbnH9lx.
Doerr, Andreas, Christian Daniel, Martin Schiegg, et al. 2018. “Probabilistic Recurrent State-Space Models.” Proceedings of the 35th International Conference on Machine Learning, July, 1280–89. https://proceedings.mlr.press/v80/doerr18a.html.
Donahue, Chris, Julian McAuley, and Miller Puckette. 2018. “Adversarial Audio Synthesis.” Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), September. https://openreview.net/forum?id=ByMVTsR5KQ.
Drouin, Alexandre, Étienne Marcotte, and Nicolas Chapados. 2022. TACTiS: Transformer-Attentional Copulas for Time Series.” Proceedings of the 39th International Conference on Machine Learning, June, 5447–93. https://proceedings.mlr.press/v162/drouin22a.html.
Du, Heming, Shouguo Du, and Wen Li. 2023. “Probabilistic Time Series Forecasting with Deep Non-Linear State Space Models.” CAAI Transactions on Intelligence Technology 8 (1): 3–13. https://doi.org/10.1049/cit2.12085.
Du, Yilun, and Igor Mordatch. 2019. “Implicit Generation and Modeling with Energy Based Models.” Advances in Neural Information Processing Systems 32. https://proceedings.neurips.cc/paper/2019/hash/378a063b8fdb1db941e34f4bde584c7d-Abstract.html.
Durbin, James, and Siem Jan Koopman. 2012. Time Series Analysis by State Space Methods. 2nd ed. Oxford University Press. https://doi.org/10.1093/acprof:oso/9780199641178.001.0001.
Feng, Shibo, Chunyan Miao, Ke Xu, et al. 2024. “Multi-Scale Attention Flow for Probabilistic Time Series Forecasting.” IEEE Transactions on Knowledge and Data Engineering 36 (5): 2056–68. https://doi.org/10.1109/TKDE.2023.3319672.
Feng, Shibo, Chunyan Miao, Zhong Zhang, and Peilin Zhao. 2024. “Latent Diffusion Transformer for Probabilistic Time Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 38 (March): 11979–87. https://doi.org/10.1609/aaai.v38i11.29085.
Gasthaus, Jan, Konstantinos Benidis, Yuyang Wang, et al. 2019. “Probabilistic Forecasting with Spline Quantile Function RNNs.” Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, April, 1901–10. https://proceedings.mlr.press/v89/gasthaus19a.html.
Gneiting, Tilmann, and Matthias Katzfuss. 2014. “Probabilistic Forecasting.” Annual Review of Statistics and Its Application 1 (1): 125–51. https://doi.org/10.1146/annurev-statistics-062713-085831.
Gneiting, Tilmann, Daniel Wolffram, Johannes Resin, et al. 2023. “Model Diagnostics and Forecast Evaluation for Quantiles.” Annual Review of Statistics and Its Application 10 (1): 597–621. https://doi.org/10.1146/annurev-statistics-032921-020240.
Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, et al. 2014. “Generative Adversarial Nets.” Advances in Neural Information Processing Systems 27. https://proceedings.neurips.cc/paper_files/paper/2014/hash/f033ed80deb0234979a61f95710dbe25-Abstract.html.
Gouttes, Adèle, Kashif Rasul, Mateusz Koren, Johannes Stephan, and Tofigh Naghibi. 2021. “Probabilistic Time Series Forecasting with Implicit Quantile Networks.” Proceedings of the Time Series Workshop at 38th International Conference on Machine Learning.
Graikos, Alexandros, Nikolay Malkin, Nebojsa Jojic, and Dimitris Samaras. 2022. “Diffusion Models as Plug-and-Play Priors.” Advances in Neural Information Processing Systems 35 (December): 14715–28. https://proceedings.neurips.cc/paper_files/paper/2022/hash/5e6cec2a9520708381fe520246018e8b-Abstract-Conference.html.
Größer, Joshua, and Ostap Okhrin. 2022. “Copulae: An Overview and Recent Developments.” WIREs Computational Statistics 14 (3): e1557. https://doi.org/10.1002/wics.1557.
Harsha, Pavithra, Ramesh Natarajan, and Dharmashankar Subramanian. 2021. “A Prescriptive Machine-Learning Framework to the Price-Setting Newsvendor Problem.” INFORMS Journal on Optimization 3 (3): 227–53. https://doi.org/10.1287/ijoo.2019.0046.
Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” Advances in Neural Information Processing Systems 33: 6840–51. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computing 9 (8): 1735–80. https://doi.org/10.1162/neco.1997.9.8.1735.
Horn, Roger A., and Charles R. Johnson. 2012. “Chapter 7: Positive Definite and Semidefinite Matrices.” In Matrix Analysis, 2nd ed. Cambridge University Press.
Huang, Chin-Wei, Ricky T. Q. Chen, Christos Tsirigotis, and Aaron Courville. 2020. “Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization.” Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), October. https://openreview.net/forum?id=te7PVH1sPxJ.
Huang, Chin-Wei, David Krueger, Alexandre Lacoste, and Aaron Courville. 2018. “Neural Autoregressive Flows.” Proceedings of the 35th International Conference on Machine Learning, July, 2078–87. https://proceedings.mlr.press/v80/huang18d.html.
Hyndman, Rob J, Anne B Koehler, Ralph D Snyder, and Simone Grose. 2002. “A State Space Framework for Automatic Forecasting Using Exponential Smoothing Methods.” International Journal of Forecasting 18 (3): 439–54. https://doi.org/10.1016/S0169-2070(01)00110-8.
Hyndman, Robin John, Anne B Koehler, J Keith Ord, and Ralph David Snyder. 2008. Forecasting with Exponential Smoothing: The State Space Approach. Springer.
Hyvärinen, Aapo. 2005. “Estimation of Non-Normalized Statistical Models by Score Matching.” Journal of Machine Learning Research 6 (24): 695–709. http://jmlr.org/papers/v6/hyvarinen05a.html.
Jawed, Shayan, and Lars Schmidt-Thieme. 2022. GQFormer: A Multi-Quantile Generative Transformer for Time Series Forecasting.” 2022 IEEE International Conference on Big Data (Big Data), December, 992–1001. https://doi.org/10.1109/BigData55660.2022.10020927.
Jeon, Jinsung, Jeonghak Kim, Haryong Song, Seunghyeon Cho, and Noseong Park. 2022. GT-GAN: General Purpose Time Series Synthesis with Generative Adversarial Networks.” Advances in Neural Information Processing Systems 35 (December): 36999–7010. https://proceedings.neurips.cc/paper_files/paper/2022/hash/f03ce573aa8bce26f77b76f1cb9ee979-Abstract-Conference.html.
Jiang, Yifan, Shiyu Chang, and Zhangyang Wang. 2021. TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up.” Advances in Neural Information Processing Systems 34: 14745–58. https://proceedings.neurips.cc/paper_files/paper/2021/hash/7c220a2091c26a7f5e9f1cfb099511e3-Abstract.html.
Julier, S. J., and J. K. Uhlmann. 2004. “Unscented Filtering and Nonlinear Estimation.” Proceedings of the IEEE 92 (3): 401–22. https://doi.org/10.1109/JPROC.2003.823141.
Kalman, R. E. 1960. “A New Approach to Linear Filtering and Prediction Problems.” Journal of Basic Engineering 82 (1): 35–45. https://doi.org/10.1115/1.3662552.
Kan, Kelvin, François-Xavier Aubet, Tim Januschowski, et al. 2022. “Multivariate Quantile Function Forecaster.” Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, May, 10603–21. https://proceedings.mlr.press/v151/kan22a.html.
Kingma, Diederik P., and Max Welling. 2014. “Auto-Encoding Variational Bayes.” In Proceedings of the Second International Conference on Learning Representations, edited by Yoshua Bengio and Yann LeCun. Banff, AB, Canada. http://arxiv.org/abs/1312.6114.
Ko, Jonathan, and Dieter Fox. 2011. “Learning GP-BayesFilters via Gaussian Process Latent Variable Models.” Autonomous Robots 30 (1): 3–23. https://doi.org/10.1007/s10514-010-9213-0.
Koenker, Roger. 2005. Quantile Regression. Cambridge University Press.
Koenker, Roger, and Gilbert Bassett. 1978. “Regression Quantiles.” Econometrica 46 (1): 33–50. https://doi.org/10.2307/1913643.
Kollovieh, Marcel, Abdul Fatir Ansari, Michael Bohlke-Schneider, Jasper Zschiegner, Hao Wang, and Yuyang (Bernie) Wang. 2023. “Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting.” Advances in Neural Information Processing Systems 36 (December): 28341–64. https://proceedings.neurips.cc/paper_files/paper/2023/hash/5a1a10c2c2c9b9af1514687bc24b8f3d-Abstract-Conference.html.
Kong, Zhifeng, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. 2020. DiffWave: A Versatile Diffusion Model for Audio Synthesis.” Proceedings of the Ninth International Conference on Learning Representations, October. https://openreview.net/forum?id=a-xFK8Ymz5J.
Koochali, Alireza, Andreas Dengel, and Sheraz Ahmed. 2021. “If You Like It, GAN ItProbabilistic Multivariate Times Series Forecast with GAN.” Engineering Proceedings 5 (1). https://doi.org/10.3390/engproc2021005040.
LeCun, Yann, Sumit Chopra, Raia Hadsell, Marc’Aurelio Ranzato, and Fu Jie Huang. 2006. “A Tutorial on Energy-Based Learning.” Predicting Structured Data, August.
LeCun, Yann, Corinna Cortes, and Christopher JC Burges. 1998. The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/.
Li, Longyuan, Jihai Zhang, Junchi Yan, et al. 2021. “Synergetic Learning of Heterogeneous Temporal Sequences for Multi-Horizon Probabilistic Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 35 (May): 8420–28. https://doi.org/10.1609/aaai.v35i10.17023.
Li, Xiang, John Thickstun, Ishaan Gulrajani, Percy S. Liang, and Tatsunori B. Hashimoto. 2022. “Diffusion-LM Improves Controllable Text Generation.” Advances in Neural Information Processing Systems 35 (December): 4328–43. https://proceedings.neurips.cc/paper_files/paper/2022/hash/1be5bc25d50895ee656b8c2d9eb89d6a-Abstract-Conference.html.
Li, Yan, Xinjiang Lu, Yaqing Wang, and Dejing Dou. 2022. “Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement.” Advances in Neural Information Processing Systems 35 (December): 23009–22. https://proceedings.neurips.cc/paper_files/paper/2022/hash/91a85f3fb8f570e6be52b333b5ab017a-Abstract-Conference.html.
Li, Yuxin, Wenchao Chen, Xinyue Hu, Bo Chen, Baolin Sun, and Mingyuan Zhou. 2023. “Transformer-Modulated Diffusion Models for Probabilistic Multivariate Time Series Forecasting.” Proceedings of the Ninth the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=qae04YACHs.
Lim, Bryan, Sercan Ö. Arık, Nicolas Loeff, and Tomas Pfister. 2021. “Temporal Fusion Transformers for Interpretable Multi-Horizon Time Series Forecasting.” International Journal of Forecasting 37 (4): 1748–64. https://doi.org/10.1016/j.ijforecast.2021.03.012.
Lin, Shengsheng, Weiwei Lin, Wentai Wu, Feiyu Zhao, Ruichao Mo, and Haotong Zhang. 2023. SegRNN: Segment Recurrent Neural Network for Long-Term Time Series Forecasting. arXiv. https://doi.org/10.48550/arXiv.2308.11200.
Liu, Jingwei, Ling Yang, Hongyan Li, and Shenda Hong. 2024. “Retrieval-Augmented Diffusion Models for Time Series Forecasting.” Advances in Neural Information Processing Systems 37 (December): 2766–86. https://proceedings.neurips.cc/paper_files/paper/2024/hash/053ee34c0971568bfa5c773015c10502-Abstract-Conference.html.
Liu, Yong, Tengge Hu, Haoran Zhang, et al. 2023. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=JePfAI8fah.
Liu, Yong, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2022. “Non-Stationary Transformers: Exploring the Stationarity in Time Series Forecasting.” Advances in Neural Information Processing Systems 35 (December): 9881–93. https://proceedings.neurips.cc/paper_files/paper/2022/hash/4054556fcaa934b0bf76da52cf4f92cb-Abstract-Conference.html.
Luo, Yihong, Siya Qiu, Xingjian Tao, Yujun Cai, and Jing Tang. 2025. “Energy-Calibrated VAE with Test Time Free Lunch.” In Computer VisionECCV 2024, edited by Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol. Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-73013-9_19.
Mirza, Mehdi, and Simon Osindero. 2014. Conditional Generative Adversarial Nets. arXiv. https://doi.org/10.48550/arXiv.1411.1784.
Mukherjee, Srayanta, Devashish Shankar, Atin Ghosh, et al. 2018. AR-MDN: Associative and Recurrent Mixture Density Networks for eRetail Demand Forecasting.” Proceedings of the VLDB Endowment 11 (5).
Neal, Radford M. 2011. MCMC Using Hamiltonian Dynamics.” In Handbook of Markov Chain Monte Carlo. Chapman; Hall/CRC.
Nguyen, Nam, and Brian Quanz. 2021. “Temporal Latent Auto-Encoder: A Method for Probabilistic Multivariate Time Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 35 (May): 9117–25. https://doi.org/10.1609/aaai.v35i10.17101.
Nie, Yuqi, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022. “A Time Series Is Worth 64 Words: Long-Term Forecasting with Transformers.” Proceedings of the Eleventh International Conference on Learning Representations, September. https://openreview.net/forum?id=Jbdc0vTOcol.
Nikitin, Alexander, Letizia Iannucci, and Samuel Kaski. 2024. TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series.” Advances in Neural Information Processing Systems 37 (December): 129042–61. https://proceedings.neurips.cc/paper_files/paper/2024/hash/e92bb8c76d1aa420171f51495ce56eaf-Abstract-Datasets_and_Benchmarks_Track.html.
Nix, D. A., and A. S. Weigend. 1994. “Estimating the Mean and Variance of the Target Probability Distribution.” Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94) 1 (June): 55–60 vol.1. https://doi.org/10.1109/ICNN.1994.374138.
Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. “Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research 22 (57): 1–64. http://jmlr.org/papers/v22/19-1028.html.
Papamakarios, George, Theo Pavlakou, and Iain Murray. 2017. “Masked Autoregressive Flow for Density Estimation.” Advances in Neural Information Processing Systems 30. https://proceedings.neurips.cc/paper_files/paper/2017/hash/6c1da886822c67822bcf3679d04369fa-Abstract.html.
Park, Youngsuk, Danielle Maddix, François-Xavier Aubet, Kelvin Kan, Jan Gasthaus, and Yuyang Wang. 2022. “Learning Quantile Functions Without Quantile Crossing for Distribution-Free Time Series Forecasting.” Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, May, 8127–50. https://proceedings.mlr.press/v151/park22a.html.
Pourahmadi, Mohsen. 2011. “Covariance Estimation: The GLM and Regularization Perspectives.” Statistical Science 26 (3): 369–87. https://doi.org/10.1214/11-STS358.
Rangapuram, Syama Sundar, Matthias W Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. 2018. “Deep State Space Models for Time Series Forecasting.” Advances in Neural Information Processing Systems 31. https://proceedings.neurips.cc/paper_files/paper/2018/hash/5cf68969fb67aa6082363a6d4e6468e2-Abstract.html.
Rasul, Kashif, Andrew Bennett, Pablo Vicente, et al. 2023. VQ-TR: Vector Quantized Attention for Time Series Forecasting.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=IxpTsFS7mh.
Rasul, Kashif, Young-Jin Park, Max Nihlén Ramström, and Kyung-Min Kim. 2022. VQ-AR: Vector Quantized Autoregressive Probabilistic Time Series Forecasting. arXiv. https://doi.org/10.48550/arXiv.2205.15894.
Rasul, Kashif, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. 2021. “Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting.” Proceedings of the 38th International Conference on Machine Learning, July, 8857–68. https://proceedings.mlr.press/v139/rasul21a.html.
Rasul, Kashif, Abdul-Saboor Sheikh, Ingmar Schuster, Urs M. Bergmann, and Roland Vollgraf. 2020. “Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows.” Proceedings of the Ninth International Conference on Learning Representations, October. https://openreview.net/forum?id=WiGQBFuVRv.
Ray, Evan L, Nutcha Wattanachit, Jarad Niemi, et al. 2020. “Ensemble Forecasts of Coronavirus Disease 2019 (COVID-19) in the US.” MedRXiv, 2020–08.
Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. 2014. “Stochastic Backpropagation and Approximate Inference in Deep Generative Models.” Proceedings of the 31st International Conference on Machine Learning, June, 1278–86. https://proceedings.mlr.press/v32/rezende14.html.
Roweis, Sam, and Zoubin Ghahramani. 1999. “A Unifying Review of Linear Gaussian Models.” Neural Computation 11 (2): 305–45. https://doi.org/10.1162/089976699300016674.
Salinas, David, Michael Bohlke-Schneider, Laurent Callot, Roberto Medico, and Jan Gasthaus. 2019. “High-Dimensional Multivariate Forecasting with Low-Rank Gaussian Copula Processes.” Advances in Neural Information Processing Systems 32. https://proceedings.neurips.cc/paper/2019/hash/0b105cf1504c4e241fcc6d519ea962fb-Abstract.html.
Salinas, David, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.” International Journal of Forecasting 36 (3): 1181–91. https://doi.org/10.1016/j.ijforecast.2019.07.001.
Seeger, Matthias W, David Salinas, and Valentin Flunkert. 2016. “Bayesian Intermittent Demand Forecasting for Large Inventories.” Advances in Neural Information Processing Systems 29. https://papers.nips.cc/paper/2016/hash/03255088ed63354a54e0e5ed957e9008-Abstract.html.
Shen, Lifeng, Weiyu Chen, and James Kwok. 2023. “Multi-Resolution Diffusion Models for Time Series Forecasting.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=mmjnr0G8ZY.
Shen, Lifeng, and James Kwok. 2023. “Non-Autoregressive Conditional Diffusion Models for Time Series Prediction.” Proceedings of the 40th International Conference on Machine Learning, July, 31016–29. https://proceedings.mlr.press/v202/shen23d.html.
Sklar, M. 1959. “Fonctions de Répartition à n Dimensions Et Leurs Marges.” Annales de l’ISUP 8: 229–31.
Sohl-Dickstein, Jascha, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” Proceedings of the 32nd International Conference on Machine Learning, June, 2256–65. https://proceedings.mlr.press/v37/sohl-dickstein15.html.
Sohn, Kihyuk, Honglak Lee, and Xinchen Yan. 2015. “Learning Structured Output Representation Using Deep Conditional Generative Models.” Advances in Neural Information Processing Systems 28. https://proceedings.neurips.cc/paper/2015/hash/8d55a249e6baa5c06772297520da2051-Abstract.html.
Song, Yang, Conor Durkan, Iain Murray, and Stefano Ermon. 2021. “Maximum Likelihood Training of Score-Based Diffusion Models.” Advances in Neural Information Processing Systems 34: 1415–28. https://proceedings.neurips.cc/paper/2021/hash/0a9fdbb17feb6ccb7ec405cfb85222c4-Abstract.html.
Sprangers, Olivier, Sebastian Schelter, and Maarten de Rijke. 2023. “Parameter-Efficient Deep Probabilistic Forecasting.” International Journal of Forecasting 39 (1): 332–45. https://doi.org/10.1016/j.ijforecast.2021.11.011.
Sun, Yinbo, Lintao Ma, Yu Liu, et al. 2022. “Memory Augmented State Space Model for Time Series Forecasting.” Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence 4 (July): 3451–57. https://doi.org/10.24963/ijcai.2022/479.
Tabak, E. G., and Cristina V. Turner. 2013. “A Family of Nonparametric Density Estimation Algorithms.” Communications on Pure and Applied Mathematics 66 (2): 145–64. https://doi.org/10.1002/cpa.21423.
Taieb, Souhaib Ben, Gianluca Bontempi, Amir F. Atiya, and Antti Sorjamaa. 2012. “A Review and Comparison of Strategies for Multi-Step Ahead Time Series Forecasting Based on the NN5 Forecasting Competition.” Expert Systems with Applications 39 (8): 7067–83. https://doi.org/10.1016/j.eswa.2012.01.039.
Tang, Binh, and David S Matteson. 2021. “Probabilistic Transformer For Time Series Analysis.” Advances in Neural Information Processing Systems 34: 23592–608. https://proceedings.neurips.cc/paper/2021/hash/c68bd9055776bf38d8fc43c0ed283678-Abstract.html.
Tarima, Sergey, and Zhanna Zenkova. 2020. “Use of Uncertain Additional Information in Newsvendor Models.” 2020 5th International Conference on Logistics Operations Management (GOL), October, 1–6. https://doi.org/10.1109/GOL49479.2020.9314751.
Tashiro, Yusuke, Jiaming Song, Yang Song, and Stefano Ermon. 2021. CSDI: Conditional Score-Based Diffusion Models for Probabilistic Time Series Imputation.” Advances in Neural Information Processing Systems 34: 24804–16. https://proceedings.neurips.cc/paper_files/paper/2021/hash/cfe8504bda37b575c70ee1a8276f3486-Abstract.html.
Theis, L., A. van den Oord, and M. Bethge. 2016. “A Note on the Evaluation of Generative Models.” Proceedings of the Fourth International Conference on Learning Representations, May, 1–10. http://www.iclr.cc/doku.php?id=iclr2016:main.
Tong, Junlong, Liping Xie, and Kanjian Zhang. 2022. “Probabilistic Decomposition Transformer for Time Series Forecasting.” Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), 478–86. https://epubs.siam.org/doi/abs/10.1137/1.9781611977653.ch54.
Uria, Benigno, Iain Murray, and Hugo Larochelle. 2013. RNADE: The Real-Valued Neural Autoregressive Density-Estimator.” Advances in Neural Information Processing Systems 26. https://proceedings.neurips.cc/paper/2013/hash/53adaf494dc89ef7196d73636eb2451b-Abstract.html.
Vahdat, Arash, and Jan Kautz. 2020. NVAE: A Deep Hierarchical Variational Autoencoder.” Advances in Neural Information Processing Systems 33: 19667–79. https://proceedings.neurips.cc/paper/2020/hash/e3b21256183cf7c2c7a66be163579d37-Abstract.html.
Van den Oord, Aaron, Oriol Vinyals, and Koray Kavukcuoglu. 2017. “Neural Discrete Representation Learning.” Advances in Neural Information Processing Systems 30. https://proceedings.neurips.cc/paper/2017/hash/7a98af17e63a0ac09ce2e96d03992fbc-Abstract.html.
Wan, Can, Jin Lin, Jianhui Wang, Yonghua Song, and Zhao Yang Dong. 2017. “Direct Quantile Regression for Nonparametric Probabilistic Forecasting of Wind Power Generation.” IEEE Transactions on Power Systems 32 (4): 2767–78. https://doi.org/10.1109/TPWRS.2016.2625101.
Wang, Xinyi, Lang Tong, and Qing Zhao. 2024. “Generative Probabilistic Time Series Forecasting and Applications in Grid Operations.” 2024 58th Annual Conference on Information Sciences and Systems (CISS), 1–6. https://doi.org/10.1109/CISS59072.2024.10480214.
Welling, Max, and Yee Whye Teh. 2011. “Bayesian Learning via Stochastic Gradient Langevin Dynamics.” Proceedings of the 28th International Conference on Machine Learning (Madison, WI, USA), ICML’11, June, 681–88.
Wen, Ruofeng, and Kari Torkkola. 2019. “Deep Generative Quantile-Copula Models for Probabilistic Forecasting.” Proceedings of the Time Series Workshop at 36th International Conference on Machine Learning (Long Beach, California), July.
Wen, Ruofeng, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. 2018. “A Multi-Horizon Quantile Recurrent Forecaster.” Time Series Workshop at 31st Conference on Neural Information Processing Systems (Long Beach, California).
Wilson, Andrew G, and Zoubin Ghahramani. 2010. “Copula Processes.” Advances in Neural Information Processing Systems 23. https://papers.nips.cc/paper_files/paper/2010/hash/fc8001f834f6a5f0561080d134d53d29-Abstract.html.
Wu, Haixu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2022. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis.” Proceedings of the Eleventh International Conference on Learning Representations, September. https://openreview.net/forum?id=ju_Uqw384Oq.
Wu, Haixu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. “Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting.” In Advances in Neural Information Processing Systems, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, vol. 34, 34. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2021/file/bcc0d400288793e8bdcd7c19a8ac0c2b-Paper.pdf.
Wu, Sifan, Xi Xiao, Qianggang Ding, Peilin Zhao, Ying Wei, and Junzhou Huang. 2020. “Adversarial Sparse Transformer for Time Series Forecasting.” Advances in Neural Information Processing Systems 33: 17105–15. https://proceedings.neurips.cc/paper_files/paper/2020/hash/c6b8c8d762da15fa8dbbdfb6baf9e260-Abstract.html.
Wu, Yilei, Yingli Qin, and Mu Zhu. 2020. “High-Dimensional Covariance Matrix Estimation Using a Low-Rank and Diagonal Decomposition.” Canadian Journal of Statistics 48 (2): 308–37. https://doi.org/10.1002/cjs.11532.
Yan, Tijin, Hengheng Gong, He Yongping, Yufeng Zhan, and Yuanqing Xia. 2024. “Probabilistic Time Series Modeling with Decomposable Denoising Diffusion Model.” Proceedings of the 41st International Conference on Machine Learning, July, 55759–77. https://proceedings.mlr.press/v235/yan24b.html.
Yang, Yiyuan, Ming Jin, Haomin Wen, et al. 2024. A Survey on Diffusion Models for Time Series and Spatio-Temporal Data. arXiv. https://doi.org/10.48550/arXiv.2404.18886.
Yeğin, Melike Nur, and Mehmet Fatih Amasyalı. 2024. “Generative Diffusion Models: A Survey of Current Theoretical Developments.” Neurocomputing 608 (December): 128373. https://doi.org/10.1016/j.neucom.2024.128373.
Yoon, Jinsung, Daniel Jarrett, and Mihaela van der Schaar. 2019. “Time-Series Generative Adversarial Networks.” Advances in Neural Information Processing Systems 32. https://proceedings.neurips.cc/paper_files/paper/2019/hash/c9efe5f26cd17ba6216bbe2a7d26d490-Abstract.html.
Zeng, Ailing, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. “Are Transformers Effective for Time Series Forecasting?” Proceedings of the AAAI Conference on Artificial Intelligence 37 (June): 11121–28. https://doi.org/10.1609/aaai.v37i9.26317.
Zhang, Jiawen, Xumeng Wen, Zhenwei Zhang, Shun Zheng, Jia Li, and Jiang Bian. 2024. ProbTS: Benchmarking Point and Distributional Forecasting Across Diverse Prediction Horizons.” In Advances in Neural Information Processing Systems, edited by A. Globerson, L. Mackey, D. Belgrave, et al., vol. 37, 37. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2024/file/55f2a27b1ac39dbfdd0fc83742dc87d7-Paper-Datasets_and_Benchmarks_Track.pdf.
Zhou, Haoyi, Shanghang Zhang, Jieqi Peng, et al. 2021. “Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 35 (May): 11106–15. https://doi.org/10.1609/aaai.v35i12.17325.
Zhou, Yunyi, Zhixuan Chu, Yijia Ruan, Ge Jin, Yuchen Huang, and Sheng Li. 2023. pTSE: A Multi-Model Ensemble Method for Probabilistic Time Series Forecasting.” Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence 4 (August): 4684–92. https://doi.org/10.24963/ijcai.2023/521.