Long-Term Time Series Forecasting

Kai Reffert

July 14, 2025

Introduction into LTSF

TSF has a long and extensive literature history. However, this work will primarily focus on recent developments in (long-term) TSF. For a comprehensive overview of earlier research and traditional TSF methods, readers are referred to existing surveys such as Box (2013; Box et al. 2015; De Gooijer and Hyndman 2006; Mahalakshmi et al. 2016; Hamilton 1994). Some traditional statistical time series forecasting models, such as ARIMA (Box and Pierce 1970) or Prophet (Taylor and Letham 2018), are still popular to this day (Long et al. 2023; Ning et al. 2022; Albahli 2025). However, they are often fit separately to each time series, come with many prior assumptions and their performances may deteriorate for long-range forecasting, making them unsuitable for large scale TSF tasks (Qin et al. 2017; Li et al. 2019). Therefore, similar to other domains, TSF research showed an increasingly large interest towards deep learning based approaches (Benidis et al. 2022; Hewamalage et al. 2021; Lara-Benítez et al. 2021).

At first, primarily recurrent neural networks (RNNs), which are specifically designed to work with sequential data, were adopted in the form of sequence-to-sequence architectures (Sutskever et al. 2014). Furthermore, many SOTA performances across TSF tasks with shorter forecasting horizons stem from models of this architecture, e.g. TimeGrad (Rasul et al. 2021), DA-RNN (Qin et al. 2017) or DeepAR (Salinas et al. 2020). In contrast, Convolutional neural networks (CNNs), which are designed for tasks where the input data has a known sequential or spatial structure, such as images or audio signals (Dosovitskiy et al. 2021; Van den Oord et al. 2016) but also time series (Benidis et al. 2022; Goodfellow et al. 2016), began to demonstrate superior performance over RNNs in various sequence modeling tasks, e.g. audio generation or machine translation (Van den Oord et al. 2016; Kalchbrenner et al. 2017). Motivated by these results, Bai et al. (2018) conducted a comprehensive comparison between CNNs and RNNs across a diverse set of sequential learning benchmarks. Their findings showed that a simple convolutional architecture, the Temporal Convolutional Network (TCN), consistently outperformed RNN-based models while also benefiting from longer effective memory. These promising results spurred increased interest in applying CNNs to time series forecasting as well. For instance, Borovykh et al. (2018) adapted the autoregressive WaveNet CNN architecture (Van den Oord et al. 2016), originally developed for raw audio synthesis, to the TSF domain and demonstrated superior performance over LSTM-based models. DeepGLO (Sen et al. 2019) combines a matrix factorization model with a TCN and outperforms traditional and RNN-based methods. However, despite their promising empirical performance, CNNs have not emerged as a definitive replacement for RNNs. Instead, the two architectures were generally viewed as complementary, with approximation theory (Jiang et al. 2021) supporting the idea that each bring distinct strengths to time series modeling. Therefore, hybrid models like LSTNet (Lai et al. 2018) or DCRNN (Li et al. 2018) gained popularity by combining CNNs and RNNs, effectively capturing both short-term dependencies and inter-series correlations through CNNs, while leveraging RNNs for modeling longer-term temporal trends. Nevertheless, both RNNs and CNNs exhibit inherent limitations when it comes to longer forecasting horizons. The main limitation of RNNs are their large information propagation paths, which directly lead to numerous issues. In particular, RNNs have performance problems in capturing long-term dependencies with poor efficiency in sequential calculations (Jia et al. 2024). Furthermore, although RNN-cells, such as LSTM (Hochreiter and Schmidhuber 1997) or GRU (Cho et al. 2014), were designed to tackle vanishing and exploding gradients (Bengio et al. 1994), those problems often could not be mitigated sufficiently for longer input sequences leading to an unstable training process (Zhou et al. 2021). On the other hand, CNNs are limited by their local receptive fields; while some argue that they offer better long-term memory than RNNs (Bai et al. 2018), their 1D convolutions can only model variations in adjacent time steps (Wu et al. 2022). Therefore, compared to models with global receptive fields, e.g. Transformers (Vaswani et al. 2017) or MLP-based architectures (Zeng et al. 2023), CNNs often fall short in handling the complexity of long-term temporal dependencies (Donghao and Xue 2023). Altogether, these limitations are critical in TSF tasks, which often require models to capture both short- and long-term repeating patterns (Lai et al. 2018). In the context of long-term TSF, the importance of modeling long-range dependencies becomes even more pronounced, as they tend to be more dispersed and harder to learn (Li et al. 2019).

In response to these challenges, Transformer-based models (Vaswani et al. 2017) were proposed as a promising alternative (Zhou et al. 2021; Li et al. 2019), offering a self-attention mechanism, which allows the model to access the entire input sequence at once, facilitating parallel processing and enabling global context understanding. Furthermore, Transformers have displayed state-of-the-art performances in capturing long-range dependency structures (Wen et al. 2023) and are SOTA across various domains, e.g. natural language processing (Brown et al. 2020), speech (Kim et al. 2022) and computer vision (Dosovitskiy et al. 2021). LogSparse, proposed by Li et al. (2019), was among the first Transformer-based methods applied to TSF. It demonstrated superior performance in modeling long-term dependencies compared to DeepAR and statistical models. Although Li et al. (2019) extended the forecasting horizon relative to earlier work, the input and output sequences they considered were still shorter compared to modern LTSF tasks. A breakthrough came when Zhou et al. (2021) introduced Informer and formalized the modern LTSF problem setting by substantially extending input and prediction horizons. Informer managed to outperform prior SOTA models including LogSparse, DeepAR, other RNN-based and statistical baselines in LTSF. A key innovation of Informer came with its switch to a DMS strategy (Zeng et al. 2023), which contrasts the IMS approach used in earlier methods. Moreover, models that follow an IMS strategy are prone to slow inference and error accumulation, issues that become particularly problematic with longer forecast lengths (Zhou et al. 2021).

In succession, the DMS strategy was successfully adopted by most SOTA LTSF models, see Table [tab:ltsf]. However, DMS forecasting is not novel. In fact, the first occurance of a DMS prediction model can be dated back to Cox (1961). Over the years several theoretical and empirical studies have shown that the direct strategy performs better when models are misspecified, i.e. the model class does not contain the true model, while the recursive approach tends to be superior for well-specified models (Weiss 1991; Tiao and Tsay 1994; Ing 2007; Chevillon and Hendry 2005). In summary, Chevillon (2007) showed that DMS is less biased, more stable, more efficient and more robust to model misspecification. Later on, Taieb and Atiya (2016) investigated different multi-step strategies with NNs in TSF and concluded that IMS is preferable for short-term forecasts when the model is likely well-specified, whereas DMS is better suited for long time series or situations where minimizing bias is crucial. Despite these findings the IMS strategy was still more popular around that time, part of the reason is that it is highly similar to well-studied autoregressive and Markovian modeling assumptions while benefiting from shorter forecasting horizons as well (Wen et al. 2018). Moreover, DMS was regarded as costly, since, without cross-learning, it required training separate models for each horizon step (Bontempi et al. 2013). However, this drawback became negligible with newer architectures efficiently sharing parameters across time steps, for example only requiring small changes in the prediction head while enabling faster prediction speeds (Zhou et al. 2021). Prior to Informer, other deep learning models also adopted DMS strategies. For instance, MQ-RNN and MQ-CNN (Wen et al. 2018) use shared-parameter decoders at each time step to produce forecasts. Building on MQ-CNN, Wen and Torkkola (2019) added a generative quantile copula improving the forecast quality. NBeats (Oreshkin et al. 2019) is built on a deep residual stack of MLPs, whereas DeepTCN (Chen et al. 2019) is a CNN-based DMS approach. Nonetheless, the DMS strategy has important drawbacks: it treats the forecasted points as independent, overlooking their mutual dependencies (Kline 2004; Bontempi et al. 2013) and it must be retrained whenever the forecast horizon is extended.

The breaktrough of Informer led to a rising adoption of LTSF models, specifically Transformer models. However, despite their advantages, the memory and time complexity of self-attention in Transformers grows quadratically $O(L^2)$ with the input length $L$, becoming a large bottleneck for long input sequences present in LTSF (Zhou et al. 2021). Hence, many of the first Transformer-based models for LTSF focused on improving the efficiency of the attention module, in which Wen et al. (2023) classify the approaches into two branches. On the one hand, models such as LogSparse (Li et al. 2019) or Pyraformer (S. Liu et al. 2021) tried enforcing a sparsity bias into the attention module. On the other hand, Informer (Zhou et al. 2021) or FEDformer (Zhou et al. 2022) analyzed low-rank properties of the self-attention matrix. Furthermore, in their respective LTSF studies each model manages to outperform previous traditional and RNN-based SOTA methods, such as ARIMA, Prophet or DeepAR on a variety of LTSF data sets (Zhou et al. 2021; Wu et al. 2021; S. Liu et al. 2021; Li et al. 2019; Zhou et al. 2022). Despite their performances, Zeng et al. (2023) point out that they were evaluated solely against IMS approaches and suggest that the observed improvement is primarily due to the adoption of the DMS strategy rather than the Transformer architecture itself. To investigate this, Zeng et al. (2023) introduce DLinear and NLinear, two simple linear MLP DMS models, which were able to outperform the Transformer-based methods on multiple different benchmarks. Thus, challenging the effectiveness of Transformers on LTSF tasks. An important aspect of DLinear and NLinear is that they are CI methods, therefore they mitigate from modeling potentially misleading cross-channel dependencies (Nie et al. 2022). In contrast, many previous methods (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022) tried to incorporate information from all channels via a CD strategy, but this approach appeared to be ineffective in comparison. Building on the success of Zeng et al. (2023) with the CI strategy, many LTSF models adopted it successfully, see Table [tab:ltsf]. Furthermore, L. Han, Ye, et al. (2024) investigate the relation between CI and CD methods more in-depth. By comparing a linear CI model to its CD counterpart, they propose that the CI approach exhibits less distribution shift, because the sum of correlation differences between train and test data has lower variation than the correlation differences of individual channels. Subsequently, L. Han, Ye, et al. (2024) propose that CD methods have high capacity and low robustness, whereas CI approaches have low capacity and high robustness. Lastly, coming to the conclusion that robustness is often more important in real-world non-stationary time series with distribution shifts; therefore, CI methods often perform better.

Since the work of Zeng et al. (2023) challenged the effectiveness of Transformers in LTSF, this opened the door for other architectures to gain back some ground. Furthermore, in what follows some important recent contributions of LTSF models are briefly described, for a detailed categorization of these models see Table [tab:ltsf].

MLP architectures.

The success of DLinear (Zeng et al. 2023) revived interest in pure MLP architectures for LTSF. At the same time, the computer vision community saw the rise of MLP-Mixer models (Tolstikhin et al. 2021; H. Liu et al. 2021; Touvron et al. 2023), which use simple MLPs to mix information within and across image input patches, achieving competitive results without relying on convolutions or self-attention. Building on this, TSMixer (Ekambaram et al. 2023) adapts the Mixer architecture for LTSF, leveraging its well-suited compatibility with sequential data due to the preservation of input order. TSMixer uses a patch-based MLP backbone enhanced with online reconciliation heads that capture hierarchical structure and cross-channel dependencies. Following TSMixer, several studies extended the idea to address specific challenges in time series modeling. TimeMixer (S. Wang et al. 2023) leverages multiscale-mixing, differentiating finer seasonal patterns and coarser trends through novel mixing blocks. U-Mixer (Ma et al. 2024) tackles the issue of non-stationarity by arranging MLP encoder-decoder blocks in a U-Net structure (Ronneberger et al. 2015) while also introducing a stationarity correction mechanism. Furthermore, HDMixer (Huang, Shen, et al. 2024) improves fixed-sized patching via length-extendable patching while also modeling hierarchical short- and long-range dynamics. Beyond Mixer-based architectures, a range of MLP-centric models have emerged that take alternative approaches to enhancing time series forecasting performance. NHITS (Challu et al. 2023) extends NBEATS (Oreshkin et al. 2019) by introducing hierarchical interpolation and multi-rate sampling to sequentially assemble forecasts across multiple temporal resolutions. FreTS (Yi et al. 2023) operates entirely in the frequency domain, using MLPs to learn real and imaginary components of transformed series. CycleNet (Lin, Lin, Hu, et al. 2024) leverages residual cycle forecasting to explicitly model periodic components. SOFTS (L. Han, Chen, et al. 2024) proposes a centralized STAR module to model inter-channel relationships more efficiently than attention mechanisms. Finally, TiDE (Das et al. 2023) employs a simple MLP-based encoder-decoder framework that combines the speed of linear models with the ability to capture nonlinear dependencies.

Transformers.

Despite the success of MLP-based approaches, Transformers remained the popular choice for LTSF tasks, see Table [tab:ltsf]. One reason for this was the introduction of PatchTST (Nie et al. 2022), which marked a turning point for Transformer-based models in time series forecasting. It adopts the CI strategy of DLinear (Zeng et al. 2023) while also introducing patching to TSF. Patching, inspired by Vision Transformers (Dosovitskiy et al. 2021), segments a time series into subseries-level patches. It allows the model to capture local semantic patterns, reduce attention complexity, and extend its receptive field, significantly boosting long-term forecasting accuracy (Nie et al. 2022). As a result, patching has since become a standard practice in time series Transformers, widely adopted in models like Crossformer (Zhang and Yan 2022), MCFormer (W. Han et al. 2024) and Pathformer (P. Chen et al. 2023). In addition to its success in Transformer-based models, patching has been adopted across other architectural families, including MLPs (S.-A. Chen et al. 2023), CNNs (Gong et al. 2024), and RNNs (Lin et al. 2023). However, the dominance of classic fixed-length patching has recently been challenged. The MLP-based HDMixer (Huang, Shen, et al. 2024) critiques the inflexibility of fixed-length patches, which can lead to information loss at the patch boundaries. It proposes length-extendable patches to better preserve local structure. In addition, DeformableTST (Luo and Wang 2024) highlights that modern Transformers have become overly reliant on patching to achieve strong performance, which limits their applicability in scenarios with short input sequences or tasks unsuited to patching. To address this, DeformableTST introduces deformable attention, a data-driven sparse attention mechanism capable of focusing on important time points without explicit patching, allowing the model to generalize across a broader range of forecasting tasks. Lastly, several works have sought alternatives to patching through other input transformations. Fredformer (Piao et al. 2024) applies a Discrete Fourier Transform to overcome frequency bias in attention, enabling more balanced learning across frequency bands. iTransformer (Liu, Hu, et al. 2023) takes a different route by inverting the input dimensions, treating time points as tokens and leveraging attention to capture multivariate correlations, improving scalability and performance without altering the Transformer’s core components.

Similar to patching, the standard Transformer encoder (Vaswani et al. 2017) has become a standard modeling choice for Transformer-based time series models. In many cases, the decoder is simply replaced with a basic flatten and linear head, e.g. MCFormer (W. Han et al. 2024), PatchTST (Nie et al. 2022), iTransformer (Liu, Hu, et al. 2023) and Fredformer (Piao et al. 2024). On top of that, many models make targeted replacements to the vanilla Transformer encoder, where it is common to make changes to the attention mechanism: Triformer (Cirstea et al. 2022) reduces complexity via triangular patch attention, SDformer (Z. Zhou et al. 2024) enhances expressiveness with spectral filtering and dynamic directional attention, SCAT (C. Zhou et al. 2024) introduces alternating attention using spectral clustering centers and CARD (X. Wang et al. 2023) aligns attention across channels to better model inter-channel dependencies. Similarly, CATS (Lu et al. 2024) removes self-attention altogether, opting for a cross-attention-only framework. To better capture long-range dependencies, Kang et al. (2024) introduce spectral attention, a frequency-based mechanism that preserves temporal patterns and improves gradient flow. Outside of encoder-only Transformer models, a few different architectures have been implemented as well. SMARTformer (Yiduo Li et al. 2023) adopts a full encoder-decoder Transformer architecture, but deviates from the standard non-autoregressive decoder commonly used in time series models. Crossformer (Zhang and Yan 2022) also uses an encoder-decoder architecture, but places special emphasis on modeling cross-dimension dependencies. To this end, it proposes a Two-Stage Attention mechanism within a hierarchical encoder-decoder structure that separately captures temporal and inter-variable correlations. In contrast, FPPformer (Shen et al. 2024) also retains the encoder-decoder setup but focuses on redesigning the decoder. It introduces a top-down decoder architecture, inspired by feature pyramid networks in computer vision (Lin et al. 2017), and enhances it with a combination of elementwise and patchwise attention to improve multiscale sequence reconstruction.

CNNs.

While Transformer- and MLP-based models have rapidly gained traction and became dominant in time series analysis, convolutional approaches have been falling out of favor (Donghao and Xue 2023). Nevertheless, several recent studies have achieved SOTA performance in LTSF using CNN-based models, renewing interest in convolutional methods. MICN (Wang et al. 2022) introduces a multi-scale convolutional architecture that captures both local features and global correlations, enabling separate modeling of trend and seasonality in time series forecasting. TimesNet (Wu et al. 2022) leverages the Fast Fourier Transform (FFT) to identify periodic patterns in time series data, which it then restructures into 2D tensors. Its core component, the TimesBlock, is built based on a convolutional inception block (Szegedy et al. 2015), enabling it to effectively model both inter-period and intra-period variations. PatchMixer (Gong et al. 2024) and ModernTCN (Donghao and Xue 2023) process time series in patches (Nie et al. 2022) and then utilize depthwise separable convolutions to achieve SOTA performance with faster training and inference speeds. Moreover, ModernTCN extends a convolution block better suited for time series, resulting in larger effective receptive fields.

RNNs.

Despite their limitations and general subpar performance in LTSF, RNNs occasionally resurfaced in LTSF research. Lin et al. (2023) identify the large number of recurrent iterations as a primary drawback of traditional RNNs. To address this, they propose SegRNN, which adopts a patching mechanism to reduce the number of recurrent steps when processing input time series. In addition, they employ a DMS strategy for prediction. This involves incorporating positional embeddings, as in Vaswani et al. (2017), which are combined with the last hidden state and then passed into a GRU cell with shared parameters. Jia et al. (2023) introduced WITRAN, which operates on rearranged 2D time series, i.e. a matrix of patches inspired by Wu et al. (2022). Then, they propose a novel RNN cell alongside the recurrent acceleration network, which processes the data points of the matrix vertically and horizontally, enabling parallel computation. Lastly, they decode the processed information with a MLP in a DMS fashion. Similarly, Jia et al. (2024) introduce TPGN, a dual-branch model that also uses a 2D representation to capture long- and short-term patterns. At its core is the Parallel Gated Network, which replaces the sequential structure of RNNs with a layer that aggregates information from previous time steps in parallel, reducing the propagation path to $O(1)$.

Other model types.

Beyond common model archetypes, LTSF has recently seen novel architectures inspired by other domains. LLM-based models like LeRet (Huang, Zhou, et al. 2024), AutoTimes (Liu et al. 2024) or Time-LLM (Jin et al. 2023) leverage pre-trained language models by aligning time series with token-based representations, enabling few-shot and in-context forecasting. Graph-based models such as Ada-MSHyper (Shang et al. 2024), CrossGNN (Huang et al. 2023) and MSGNet (Cai et al. 2024) introduce graph structures to better capture multi-scale or inter-series correlations. Lastly, dynamical system-based approaches like Koopa (Liu, Li, et al. 2023) and Attraos (Hu et al. 2024) leverage Koopman embeddings to linearize complex dynamics or draw on chaos theory, respectively.

In addition to exploring different backbone NN architectures, studies have also examined the impact of other design choices.

Sparse models.

Despite the success of Zeng et al. (2023) with simple linear models, many of the previously discussed methods rely on significantly larger approaches with a high amount of parameters. To counter the trend toward increasingly large models, some methods focused on more efficient sparser models that often only implement one or a few linear layers. For instance, FITS (Xu et al. 2023), LightTS (Zhang et al. 2022), SSCNN (Deng et al. 2024), Attraos (Hu et al. 2024) and SparseTSF (Lin, Lin, Wu, et al. 2024) achieve performances comparable to SOTA methods while being several magnitudes smaller, resulting in faster training and inference speeds as well as a smaller memory footprint. These models first simplify the forecasting task by downsampling (Lin, Lin, Wu, et al. 2024; Zhang et al. 2022), by decomposition (Deng et al. 2024) or by operating in the frequency domain via FFT (Xu et al. 2023) or via phase space reconstruction (Hu et al. 2024). Then, they process the condensed representation with a smaller model, often containing only a single (non-) linear layer.

Channel dependence.

The success of DLinear (Zeng et al. 2023) and PatchTST (Nie et al. 2022) with the CI strategy led to many subsequent CI models such as Pathformer (P. Chen et al. 2023), CATS (Kim et al. 2024) and DeformableTST (Luo and Wang 2024). However, growing interest in leveraging inter-series correlations developed into a resurgence of CD methods, which can be broadly categorized by their mechanism of capturing cross-channel interactions. A large subset utilizes cross-channel attention, with models like Crossformer (Zhang and Yan 2022), CARD (X. Wang et al. 2023), Client (Gao et al. 2023) and MCformer (W. Han et al. 2024) incorporating attention modules to jointly model temporal and inter-channel dependencies. Another line of work applies spectral or frequency-based modeling, such as SDformer (Chen et al. 2024), Fredformer (Piao et al. 2024) and FreTS (Yi et al. 2023), which leverage frequency-domain representations to capture global dependencies and improve channel interaction modeling. Meanwhile, MLP-Mixer-based architectures offer an alternative to attention-heavy designs. For instance, TSMixer (Ekambaram et al. 2023) introduces hybrid channel modeling, while SOFTS (L. Han, Chen, et al. 2024) similarly proposes a centralized STAR module to fuse global and intra-channel representations. On another note, models like TimeMixer (S. Wang et al. 2023), ModernTCN (Donghao and Xue 2023) and MICN (Wang et al. 2022) explore multi-scale decomposition and convolutional modeling to disentangle and aggregate information across variables and temporal resolutions. Lastly, CrossGNN (Huang et al. 2023) applies graph-based modules to model cross-variable structure.

DMS dominance.

Although Li et al. (2019) were among the first to apply Transformers to LTSF in an IMS setting, nearly all major recent models for LTSF adopt a DMS forecasting strategy, see Table [tab:ltsf]. This trend can be traced back to Informer (Zhou et al. 2021), which popularized the use of non-autoregressive decoding to mitigate error accumulation in long-range predictions of IMS methods, as mathematically shown by Sun and Boning (2022). Even recurrent architectures, which are closely related to IMS forecasting, have adopted a DMS strategy for LTSF (Lin et al. 2023; Jia et al. 2023). Two recent works stand out as rare exceptions that reintroduce autoregressive principles into LTSF. SMARTformer (Yiduo Li et al. 2023) proposes a semi-autoregressive (SAR) decoding approach, consisting of two key components: a segment autoregressive layer that generates the forecast iteratively in segments, and a non-autoregressive refining layer that globally refines the output in a DMS manner. This hybrid structure captures both local and global temporal patterns. Empirical results show that SMARTformer achieves consistent improvements in both univariate and multivariate forecasting tasks while an ablation study highlights that other SOTA LTSF methods also benefit from a SAR decoder. On the other hand, AutoTimes (Liu et al. 2024) leverages the autoregressive nature of LLMs to forecast time series through token-wise next-step prediction. However, its main novelty lies in repurposing decoder-only LLMs for time series.

[tab:ltsf]

Model	Venue	IMS/DMS	Backbone	CI/CD
LogSparse (Li et al. 2019)	NeurIPS’19	IMS	Transformer (D)	CD
Autoformer (Wu et al. 2021)	NeurIPS’21	DMS	Transformer (E-D)	CD
Informer (Zhou et al. 2021)	AAAI’21	DMS	Transformer (E-D)	CD
Triformer (Cirstea et al. 2022)	IJCAI’22	DMS	Transformer (E)	CD
LightTS (Zhang et al. 2022)	-	DMS	MLP	CD
Koopa (Liu, Li, et al. 2023)	NeurIPS’23	DMS	Koopman Theory (Koopman 1931)	CD
CrossGNN (Huang et al. 2023)	NeurIPS’23	DMS	GNN	CD
WITRAN (Jia et al. 2023)	NeurIPS’23	DMS	RNN	CI
FreTS (Yi et al. 2023)	NeurIPS’23	DMS	MLP	CD
MICN (Wang et al. 2022)	ICLR’23	DMS	CNN	CD
TimesNet (Wu et al. 2022)	ICLR’23	DMS	CNN	CD
Crossformer (Zhang and Yan 2022)	ICLR’23	DMS	Transformer (E-D)	CD
PatchTST (Nie et al. 2022)	ICLR’23	DMS	Transformer (E)	CI
DLinear (Zeng et al. 2023)	AAAI’23	DMS	MLP	CI
NHITS (Challu et al. 2023)	AAAI’23	DMS	MLP	CD
SMARTformer (Yiduo Li et al. 2023)	IJCAI’23	IMS & DMS	Transformer (E-D)	CD
TSMixer (Ekambaram et al. 2023)	KDD’23	DMS	MLP	CI/CD
TiDE (Das et al. 2023)	TMLR’23	DMS	MLP	CI
SegRNN (Lin et al. 2023)	-	DMS	RNN	CI
Client (Gao et al. 2023)	-	DMS	Transformer (E)	CD
Attraos (Hu et al. 2024)	NeurIPS’24	DMS	Chaos Theory (Devaney 2018)	CI
Ada-MSHyper (Shang et al. 2024)	NeurIPS’24	DMS	HGNN (Feng et al. 2019)	CI
SSCNN (Deng et al. 2024)	NeurIPS’24	DMS	CNN & Decomposition	CI
SOFTS (L. Han, Chen, et al. 2024)	NeurIPS’24	DMS	MLP	CD
CycleNet (Deng et al. 2024)	NeurIPS’24	DMS	MLP	CI
CATS (Kim et al. 2024)	NeurIPS’24	DMS	Transformer (E)	CI
DeformableTST (Luo and Wang 2024)	NeurIPS’24	DMS	Transformer (E)	CI
TPGN (Liu et al. 2024)	NeurIPS’24	DMS	RNN	CI
AutoTimes (Liu et al. 2024)	NeurIPS’24	IMS	LLM (D)	CI
SparseTSF (Lin, Lin, Wu, et al. 2024)	ICML’24	DMS	MLP	CI
SAMformer (Ilbert et al. 2024)	ICML’24	DMS	Transformer (E)	CD
TimeMixer (P. Chen et al. 2023)	ICLR’24	DMS	MLP	CD
Pathformer (P. Chen et al. 2023)	ICLR’24	DMS	Transformer (E)	CI
Time-LLM (Jin et al. 2023)	ICLR’24	DMS	LLM	CI
iTransformer (Liu, Hu, et al. 2023)	ICLR’24	DMS	Transformer (E)	CD
FITS (Xu et al. 2023)	ICLR’24	DMS	MLP	CI
CARD (X. Wang et al. 2023)	ICLR’24	DMS	Transformer (E)	CD
ModernTCN (Donghao and Xue 2023)	ICLR’24	DMS	CNN	CD
MSGNet (Cai et al. 2024)	AAAI’24	DMS	GNN	CD
UMixer (Ma et al. 2024)	AAAI’24	DMS	MLP	CD
HDMixer (Huang, Shen, et al. 2024)	AAAI’24	DMS	MLP	CD
LeRet (Huang, Zhou, et al. 2024)	IJCAI’24	DMS	LLM + Retentive Net	CI
PatchMixer (Gong et al. 2024)	IJCAI’24	DMS	CNN	CI
SDformer (Z. Zhou et al. 2024)	IJCAI’24	DMS	Transformer (E)	CD
SCAT (C. Zhou et al. 2024)	IJCAI’24	DMS	Transformer (E)	CI
Fredformer (Piao et al. 2024)	KDD’24	DMS	Transformer (E)	CD
MCformer (W. Han et al. 2024)	IoT-J’24	DMS	Transformer (E)	CD

In summary, the literature on point LTSF is extensive, with numerous methods achieving strong performances. Hence, determining the definitive state-of-the-art in point LTSF is challenging due to the vast and rapidly evolving literature. However, certain models, DLinear, PatchTST and iTransformer (Zeng et al. 2023; Nie et al. 2022; Liu, Hu, et al. 2023), have emerged as de facto standards for comparison, frequently adopted as baseline or comparative methods in a wide range of recent works (Jia et al. 2024, 2023; Lu et al. 2024; Lin et al. 2023; L. Han, Chen, et al. 2024; Lin, Lin, Hu, et al. 2024; Luo and Wang 2024; Hu et al. 2024; Shang et al. 2024). Consequently, we consider them representative of the current state-of-the-art in point LTSF. Nonetheless, the distinction between IMS and DMS strategies has been largely overlooked, with DMS decoding being often adopted by default. Moreover, DMS forecasting can underperform in certain settings, which has not been sufficiently investigated in prior work. To address this gap, we empirically examine scenarios when and why DMS may fall short, using multi-world examples to highlight the conditions under which IMS offers advantages. Furthermore, while SOTA point LTSF models are highly effective at predicting the conditional mean (Yuxin Li et al. 2023), many real-world scenarios require a more nuanced understanding of uncertainty, making probabilistic forecasts preferable. Hence, the next section reviews existing probabilistic models proposed for time series forecasting.

Albahli, Saleh. 2025. “LSTM Vs. Prophet: Achieving Superior Accuracy in Dynamic Electricity Demand Forecasting.” Energies 18 (2): 278. https://doi.org/10.3390/en18020278.

Bai, Shaojie, J. Zico Kolter, and Vladlen Koltun. 2018. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv. https://doi.org/10.48550/arXiv.1803.01271.

Bengio, Y., P. Simard, and P. Frasconi. 1994. “Learning Long-Term Dependencies with Gradient Descent Is Difficult.” IEEE Transactions on Neural Networks 5 (2): 157–166. https://doi.org/10.1109/72.279181.

Benidis, Konstantinos, Syama Sundar Rangapuram, Valentin Flunkert, et al. 2022. “Deep Learning for Time Series Forecasting: Tutorial and Literature Survey.” ACM Computing Surveys 55 (6). https://doi.org/10.1145/3533382.

Bontempi, Gianluca, Souhaib Ben Taieb, and Yann-Aël Le Borgne. 2013. “Machine Learning Strategies for Time Series Forecasting.” In Business Intelligence: Second European Summer School, eBISS 2012, Brussels, Belgium, July 15-21, 2012, Tutorial Lectures, edited by Marie-Aude Aufaure and Esteban Zimányi. Springer. https://doi.org/10.1007/978-3-642-36318-4_3.

Borovykh, Anastasia, Sander Bohte, and Cornelis W. Oosterlee. 2018. Conditional Time Series Forecasting with Convolutional Neural Networks. arXiv. https://doi.org/10.48550/arXiv.1703.04691.

Box, G. E. P., and David A. Pierce. 1970. “Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models.” Journal of the American Statistical Association 65 (332): 1509–1526. https://doi.org/10.1080/01621459.1970.10481180.

Box, George. 2013. “Box and Jenkins: Time Series Analysis, Forecasting and Control.” In A Very British Affair: Six Britons and the Development of Time Series Analysis During the 20th Century, edited by Terence C. Mills. Palgrave Macmillan UK. https://doi.org/10.1057/9781137291264_6.

Box, George EP, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. 2015. Time Series Analysis: Forecasting and Control. Wiley Series in Probability and Statistics. John Wiley & Sons.

Brown, Tom, Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems, edited by H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, vol. 33, 33. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.

Cai, Wanlin, Yuxuan Liang, Xianggen Liu, Jianshuai Feng, and Yuankai Wu. 2024. “MSGNet: Learning Multi-Scale Inter-Series Correlations for Multivariate Time Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 38 (March): 11141–11149. https://doi.org/10.1609/aaai.v38i10.28991.

Challu, Cristian, Kin G. Olivares, Boris N. Oreshkin, Federico Garza Ramirez, Max Mergenthaler Canseco, and Artur Dubrawski. 2023. “NHITS: Neural Hierarchical Interpolation for Time Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 37 (June): 6989–6997. https://doi.org/10.1609/aaai.v37i6.25854.

Chen, Peng, Yingying Zhang, Yunyao Cheng, et al. 2023. “Pathformer: Multi-Scale Transformers with Adaptive Pathways for Time Series Forecasting.” International Conference on Learning Representations, October. https://openreview.net/forum?id=lJkOCMP2aW.

Chen, Si-An, Chun-Liang Li, Sercan O. Arik, Nathanael Christian Yoder, and Tomas Pfister. 2023. “TSMixer: An All-MLP Architecture for Time Series Forecasting.” Transactions on Machine Learning Research. https://openreview.net/forum?id=wbpxTuXgm0.

Chen, Yitian, Yanfei Kang, Yixiong Chen, and Zizhuo Wang. 2019. “Probabilistic Forecasting with Temporal Convolutional Neural Network.” MileTS ’19: 5th KDD Workshop on Mining and Learning from Time Series (Anchorage, Alaska, USA), August, 11. https://doi.org/https://doi.org/10.1145/1122445.1122456.

Chen, Zhicheng, Shibo Feng, Zhong Zhang, Xi Xiao, Xingyu Gao, and Peilin Zhao. 2024. “SDformer: Similarity-Driven Discrete Transformer For Time Series Generation.” Advances in Neural Information Processing Systems 37 (December): 132179–132207. https://proceedings.neurips.cc/paper_files/paper/2024/hash/ee6c4b99b4c0d3d60efd22c1ecdd9891-Abstract-Conference.html.

Chevillon, Guillaume. 2007. “Direct Multi-Step Estimation and Forecasting.” Journal of Economic Surveys 21 (4): 746–785. https://doi.org/10.1111/j.1467-6419.2007.00518.x.

Chevillon, Guillaume, and David F. Hendry. 2005. “Non-Parametric Direct Multi-Step Estimation for Forecasting Economic Processes.” International Journal of Forecasting 21 (2): 201–218. https://doi.org/10.1016/j.ijforecast.2004.08.004.

Cho, Kyunghyun, Bart van Merrienboer, Çaglar Gülçehre, et al. 2014. “Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation.” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Doha, Qatar), 1724–1734. http://aclweb.org/anthology/D/D14/D14-1179.pdf.

Cirstea, Razvan-Gabriel, Chenjuan Guo, Bin Yang, Tung Kieu, Xuanyi Dong, and Shirui Pan. 2022. “Triformer: Triangular, Variable-Specific Attentions for Long Sequence Multivariate Time Series Forecasting.” International Joint Conference on Artificial Intelligence 3 (July): 1994–2001. https://doi.org/10.24963/ijcai.2022/277.

Cox, D. R. 1961. “Prediction by Exponentially Weighted Moving Averages and Related Methods.” Journal of the Royal Statistical Society: Series B (Methodological) 23 (2): 414–422. https://doi.org/10.1111/j.2517-6161.1961.tb00424.x.

Das, Abhimanyu, Weihao Kong, Andrew Leach, Shaan K. Mathur, Rajat Sen, and Rose Yu. 2023. “Long-Term Forecasting with TiDE: Time-Series Dense Encoder.” Transactions on Machine Learning Research. https://openreview.net/forum?id=pCbC3aQB5W.

De Gooijer, Jan G., and Rob J. Hyndman. 2006. “25 Years of Time Series Forecasting.” International Journal of Forecasting, Twenty five years of forecasting, vol. 22 (3): 443–473. https://doi.org/10.1016/j.ijforecast.2006.01.001.

Deng, Jinliang, Feiyang Ye, Du Yin, Xuan Song, Ivor Tsang, and Hui Xiong. 2024. “Parsimony or Capability? Decomposition Delivers Both in Long-Term Time Series Forecasting.” Advances in Neural Information Processing Systems 37 (December): 66687–66712. https://proceedings.neurips.cc/paper_files/paper/2024/hash/7b122d0a0dcb1a86ffa25ccba154652b-Abstract-Conference.html.

Devaney, Robert. 2018. An Introduction to Chaotic Dynamical Systems. 2nd ed. CRC press.

Donghao, Luo, and Wang Xue. 2023. “ModernTCN: A Modern Pure Convolution Structure for General Time Series Analysis.” Proceedings of the 12th International Conference on Learning Representations (ICLR 2024), October. https://openreview.net/forum?id=vpJMJerXHU.

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, et al. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” Proceedings of the 9th International Conference on Learning Representations (ICLR 2021). https://openreview.net/forum?id=YicbFdNTTy.

Ekambaram, Vijay, Arindam Jati, Nam Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2023. “TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting.” Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (New York, NY, USA), KDD ’23, August, 459–469. https://doi.org/10.1145/3580305.3599533.

Feng, Yifan, Haoxuan You, Zizhao Zhang, Rongrong Ji, and Yue Gao. 2019. “Hypergraph Neural Networks.” Proceedings of the AAAI Conference on Artificial Intelligence (Honolulu, Hawaii, USA), AAAI’19, vol. 33 (January): 3558–3565. https://doi.org/10.1609/aaai.v33i01.33013558.

Gao, Jiaxin, Wenbo Hu, and Yuntian Chen. 2023. Client: Cross-Variable Linear Integrated Enhanced Transformer for Multivariate Long-Term Time Series Forecasting. arXiv. https://doi.org/10.48550/arXiv.2305.18838.

Gong, Zeying, Yujin Tang, and Junwei Liang. 2024. “PatchMixer: A Patch-Mixing Architecture for Long-Term Time Series Forecasting.” Proceedings of the 6th Data Science Meets Optimisation Workshop at the Thirty-Third International Joint Conference on Artificial Intelligence, October.

Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. The MIT Press.

Hamilton, James D. 1994. Time Series Analysis. Princeton University Press.

Han, Lu, Xu-Yang Chen, Han-Jia Ye, and De-Chuan Zhan. 2024. “SOFTS: Efficient Multivariate Time Series Forecasting with Series-Core Fusion.” Advances in Neural Information Processing Systems 37 (December): 64145–64175. https://proceedings.neurips.cc/paper_files/paper/2024/hash/754612bde73a8b65ad8743f1f6d8ddf6-Abstract-Conference.html.

Han, Lu, Han-Jia Ye, and De-Chuan Zhan. 2024. “The Capacity and Robustness Trade-Off: Revisiting the Channel Independent Strategy for Multivariate Time Series Forecasting.” IEEE Transactions on Knowledge and Data Engineering 36 (11): 7129–7142. https://doi.org/10.1109/TKDE.2024.3400008.

Han, Wenyong, Tao Zhu, Liming Chen, Huansheng Ning, Yang Luo, and Yaping Wan. 2024. “MCformer: Multivariate Time Series Forecasting With Mixed-Channels Transformer.” IEEE Internet of Things Journal 11 (17): 28320–28329. https://doi.org/10.1109/JIOT.2024.3401697.

Hewamalage, Hansika, Christoph Bergmeir, and Kasun Bandara. 2021. “Recurrent Neural Networks for Time Series Forecasting: Current Status and Future Directions.” International Journal of Forecasting 37 (1): 388–427. https://doi.org/10.1016/j.ijforecast.2020.06.008.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computing 9 (8): 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.

Hu, Jiaxi, Yuehong Hu, Wei Chen, et al. 2024. “Attractor Memory for Long-Term Time Series Forecasting: A Chaos Perspective.” Advances in Neural Information Processing Systems 37 (December): 20786–20818. https://proceedings.neurips.cc/paper_files/paper/2024/hash/24ef004f733548db6b3197d9f68dcb85-Abstract-Conference.html.

Huang, Qihe, Lei Shen, Ruixin Zhang, et al. 2023. “CrossGNN: Confronting Noisy Multivariate Time Series Via Cross Interaction Refinement.” Advances in Neural Information Processing Systems 36 (December): 46885–46902. https://proceedings.neurips.cc/paper_files/paper/2023/hash/9278abf072b58caf21d48dd670b4c721-Abstract-Conference.html.

Huang, Qihe, Lei Shen, Ruixin Zhang, et al. 2024. “HDMixer: Hierarchical Dependency with Extendable Patch for Multivariate Time Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 38 (March): 12608–12616. https://doi.org/10.1609/aaai.v38i11.29155.

Huang, Qihe, Zhengyang Zhou, Kuo Yang, Gengyu Lin, Zhongchao Yi, and Yang Wang. 2024. “LeRet: Language-Empowered Retentive Network for Time Series Forecasting.” Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence 5 (August): 4165–4173. https://doi.org/10.24963/ijcai.2024/460.

Ilbert, Romain, Ambroise Odonnat, Vasilii Feofanov, et al. 2024. “SAMformer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention.” Proceedings of the 41st International Conference on Machine Learning, July, 20924–20954. https://proceedings.mlr.press/v235/ilbert24a.html.

Ing, Ching-Kang. 2007. “Accumulated Prediction Errors, Information Criteria and Optimal Forecasting for Autoregressive Time Series.” The Annals of Statistics 35 (3): 1238–1277. https://doi.org/10.1214/009053606000001550.

Jia, Yuxin, Youfang Lin, Xinyan Hao, Yan Lin, Shengnan Guo, and Huaiyu Wan. 2023. “WITRAN: Water-Wave Information Transmission and Recurrent Acceleration Network for Long-Range Time Series Forecasting.” Advances in Neural Information Processing Systems 36 (December): 12389–12456. https://proceedings.neurips.cc/paper_files/paper/2023/hash/2938ad0434a6506b125d8adaff084a4a-Abstract-Conference.html.

Jia, Yuxin, Youfang Lin, Jing Yu, Shuo Wang, Tianhao Liu, and Huaiyu Wan. 2024. “PGN: The RNN’s New Successor Is Effective for Long-Range Time Series Forecasting.” Advances in Neural Information Processing Systems 37 (December): 84139–84168. https://proceedings.neurips.cc/paper_files/paper/2024/hash/990641d09f71bcee0060a8f1704ab8e2-Abstract-Conference.html.

Jiang, Haotian, Zhong Li, and Qianxiao Li. 2021. “Approximation Theory of Convolutional Architectures for Time Series Modelling.” Proceedings of the 38th International Conference on Machine Learning, July, 4961–4970. https://proceedings.mlr.press/v139/jiang21d.html.

Jin, Ming, Shiyu Wang, Lintao Ma, et al. 2023. “Time-LLM: Time Series Forecasting by Reprogramming Large Language Models.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=Unb5CVPtae.

Kalchbrenner, Nal, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2017. Neural Machine Translation in Linear Time. arXiv. https://doi.org/10.48550/arXiv.1610.10099.

Kang, Bong G., Dongjun Lee, HyunGi Kim, DoHyun Chung, and Sungroh Yoon. 2024. “Introducing Spectral Attention for Long-Range Dependency in Time Series Forecasting.” Advances in Neural Information Processing Systems 37 (December): 136509–136544. https://proceedings.neurips.cc/paper_files/paper/2024/hash/f6adf61977467560f79b95485d1f3a79-Abstract-Conference.html.

Kim, Dongbin, Jinseong Park, Jaewook Lee, and Hoki Kim. 2024. “Are Self-Attentions Effective for Time Series Forecasting?” Advances in Neural Information Processing Systems 37 (December): 114180–114209. https://proceedings.neurips.cc/paper_files/paper/2024/hash/cf66f995883298c4db2f0dcba28fb211-Abstract-Conference.html.

Kim, Sehoon, Amir Gholami, Albert Shaw, et al. 2022. “Squeezeformer: An Efficient Transformer for Automatic Speech Recognition.” Advances in Neural Information Processing Systems 35 (December): 9361–9373. https://proceedings.neurips.cc/paper_files/paper/2022/hash/3ccf6da39eeb8fefc8bbb1b0124adbd1-Abstract-Conference.html.

Kline, Douglas M. 2004. “Methods for Multi-Step Time Series Forecasting Neural Networks.” In Neural Networks in Business Forecasting. IGI Global Scientific Publishing. https://doi.org/10.4018/978-1-59140-176-6.ch012.

Koopman, B. O. 1931. “Hamiltonian Systems and Transformation in Hilbert Space.” Proceedings of the National Academy of Sciences 17 (5): 315–318. https://doi.org/10.1073/pnas.17.5.315.

Lai, Guokun, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. 2018. “Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks.” The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (New York, NY, USA), SIGIR ’18, June, 95–104. https://doi.org/10.1145/3209978.3210006.

Lara-Benítez, Pedro, Manuel Carranza-García, and José C. Riquelme. 2021. “An Experimental Review on Deep Learning Architectures for Time Series Forecasting.” International Journal of Neural Systems 31 (03): 2130001. https://doi.org/10.1142/S0129065721300011.

Li, Shiyang, Xiaoyong Jin, Yao Xuan, et al. 2019. “Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting.” Advances in Neural Information Processing Systems 32. https://proceedings.neurips.cc/paper_files/paper/2019/hash/6775a0635c302542da2c32aa19d86be0-Abstract.html.

Li, Yaguang, Rose Yu, Cyrus Shahabi, and Yan Liu. 2018. “Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting.” Proceedings of the Sixth International Conference on Learning Representations. https://openreview.net/forum?id=SJiHXGWAZ.

Li, Yiduo, Shiyi Qi, Zhe Li, Zhongwen Rao, Lujia Pan, and Zenglin Xu. 2023. “SMARTformer: Semi-Autoregressive Transformer with Efficient Integrated Window Attention for Long Time Series Forecasting.” Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence 3 (August): 2169–2177. https://doi.org/10.24963/ijcai.2023/241.

Li, Yuxin, Wenchao Chen, Xinyue Hu, Bo Chen, Baolin Sun, and Mingyuan Zhou. 2023. “Transformer-Modulated Diffusion Models for Probabilistic Multivariate Time Series Forecasting.” Proceedings of the Ninth the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=qae04YACHs.

Lin, Shengsheng, Weiwei Lin, Xinyi Hu, Wentai Wu, Ruichao Mo, and Haocheng Zhong. 2024. “CycleNet: Enhancing Time Series Forecasting Through Modeling Periodic Patterns.” Advances in Neural Information Processing Systems 37 (December): 106315–106345. https://proceedings.neurips.cc/paper_files/paper/2024/hash/bfe7998398779dde03cad7a73b1f81b6-Abstract-Conference.html.

Lin, Shengsheng, Weiwei Lin, Wentai Wu, Haojun Chen, and Junjie Yang. 2024. “SparseTSF: Modeling Long-Term Time Series Forecasting with *1k* Parameters.” Proceedings of the 41st International Conference on Machine Learning, July, 30211–30226. https://proceedings.mlr.press/v235/lin24n.html.

Lin, Shengsheng, Weiwei Lin, Wentai Wu, Feiyu Zhao, Ruichao Mo, and Haotong Zhang. 2023. SegRNN: Segment Recurrent Neural Network for Long-Term Time Series Forecasting. arXiv. https://doi.org/10.48550/arXiv.2308.11200.

Lin, Tsung-Yi, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. “Feature Pyramid Networks for Object Detection.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July.

Liu, Hanxiao, Zihang Dai, David So, and Quoc V Le. 2021. “Pay Attention to MLPs.” Advances in Neural Information Processing Systems 34: 9204–9215. https://proceedings.neurips.cc/paper/2021/hash/4cc05b35c2f937c5bd9e7d41d3686fff-Abstract.html.

Liu, Shizhan, Hang Yu, Cong Liao, et al. 2021. “Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting.” Proceedings of the Tenth International Conference on Learning Representations, October. https://openreview.net/forum?id=0EXmFzUn5I.

Liu, Yong, Tengge Hu, Haoran Zhang, et al. 2023. “iTransformer: Inverted Transformers Are Effective for Time Series Forecasting.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=JePfAI8fah.

Liu, Yong, Chenyu Li, Jianmin Wang, and Mingsheng Long. 2023. “Koopa: Learning Non-Stationary Time Series Dynamics with Koopman Predictors.” Advances in Neural Information Processing Systems 36 (December): 12271–12290. https://proceedings.neurips.cc/paper_files/paper/2023/hash/28b3dc0970fa4624a63278a4268de997-Abstract-Conference.html.

Liu, Yong, Guo Qin, Xiangdong Huang, Jianmin Wang, and Mingsheng Long. 2024. “AutoTimes: Autoregressive Time Series Forecasters via Large Language Models.” Advances in Neural Information Processing Systems 37 (December): 122154–122184. https://proceedings.neurips.cc/paper_files/paper/2024/hash/dcf88cbc8d01ce7309b83d0ebaeb9d29-Abstract-Conference.html.

Long, Bowen, Fangya Tan, and Mark Newman. 2023. “Forecasting the Monkeypox Outbreak Using ARIMA, Prophet, NeuralProphet, and LSTM Models in the United States.” Forecasting 5 (1): 127–137. https://doi.org/10.3390/forecast5010005.

Lu, Jiecheng, Xu Han, Yan Sun, and Shihao Yang. 2024. “CATS: Enhancing Multivariate Time Series Forecasting by Constructing Auxiliary Time Series as Exogenous Variables.” Proceedings of the 41st International Conference on Machine Learning, July, 32990–33006. https://proceedings.mlr.press/v235/lu24d.html.

Luo, Donghao, and Xue Wang. 2024. “DeformableTST: Transformer for Time Series Forecasting Without Over-Reliance on Patching.” Advances in Neural Information Processing Systems 37 (December): 88003–88044. https://proceedings.neurips.cc/paper_files/paper/2024/hash/a0b1082fc7823c4c68abcab4fa850e9c-Abstract-Conference.html.

Ma, Xiang, Xuemei Li, Lexin Fang, Tianlong Zhao, and Caiming Zhang. 2024. “U-Mixer: An Unet-Mixer Architecture with Stationarity Correction for Time Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 38 (March): 14255–14262. https://doi.org/10.1609/aaai.v38i13.29337.

Mahalakshmi, G., S. Sridevi, and S. Rajaram. 2016. “A Survey on Forecasting of Time Series Data.” 2016 International Conference on Computing Technologies and Intelligent Data Engineering (ICCTIDE’16), January, 1–8. https://doi.org/10.1109/ICCTIDE.2016.7725358.

Nie, Yuqi, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022. “A Time Series Is Worth 64 Words: Long-Term Forecasting with Transformers.” Proceedings of the Eleventh International Conference on Learning Representations, September. https://openreview.net/forum?id=Jbdc0vTOcol.

Ning, Yanrui, Hossein Kazemi, and Pejman Tahmasebi. 2022. “A Comparative Machine Learning Study for Time Series Oil Production Forecasting: ARIMA, LSTM, and Prophet.” Computers & Geosciences 164: 105126. https://doi.org/https://doi.org/10.1016/j.cageo.2022.105126.

Oreshkin, Boris N., Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2019. “N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting.” Proceedings of the Eighth International Conference on Learning Representations, September. https://openreview.net/forum?id=r1ecqn4YwB.

Piao, Xihao, Zheng Chen, Taichi Murayama, Yasuko Matsubara, and Yasushi Sakurai. 2024. “Fredformer: Frequency Debiased Transformer for Time Series Forecasting.” Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (New York, NY, USA), KDD ’24, August, 2400–2410. https://doi.org/10.1145/3637528.3671928.

Qin, Yao, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison W. Cottrell. 2017. “A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction.” Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, 2627–2633. https://doi.org/10.24963/ijcai.2017/366.

Rasul, Kashif, Calvin Seward, Ingmar Schuster, and Roland Vollgraf. 2021. “Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting.” Proceedings of the 38th International Conference on Machine Learning, July, 8857–8868. https://proceedings.mlr.press/v139/rasul21a.html.

Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: Convolutional Networks for Biomedical Image Segmentation.” In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, edited by Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi. Springer International Publishing. https://doi.org/10.1007/978-3-319-24574-4_28.

Salinas, David, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. “DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks.” International Journal of Forecasting 36 (3): 1181–1191. https://doi.org/10.1016/j.ijforecast.2019.07.001.

Sen, Rajat, Hsiang-Fu Yu, and Inderjit S Dhillon. 2019. “Think Globally, Act Locally: A Deep Neural Network Approach to High-Dimensional Time Series Forecasting.” Advances in Neural Information Processing Systems 32. https://proceedings.neurips.cc/paper_files/paper/2019/hash/3a0844cee4fcf57de0c71e9ad3035478-Abstract.html.

Shang, Zongjiang, Ling Chen, Binqing Wu, and Dongliang Cui. 2024. “Ada-MSHyper: Adaptive Multi-Scale Hypergraph Transformer for Time Series Forecasting.” Advances in Neural Information Processing Systems 37 (December): 33310–33337. https://proceedings.neurips.cc/paper_files/paper/2024/hash/3a6935d11910d6f9142b0a1e36fc6753-Abstract-Conference.html.

Shen, Li, Yuning Wei, Yangzhu Wang, and Huaxin Qiu. 2024. “Take an Irregular Route: Enhance the Decoder of Time-Series Forecasting Transformer.” IEEE Internet of Things Journal 11 (8): 14344–14356. https://doi.org/10.1109/JIOT.2023.3341099.

Sun, Fan-Keng, and Duane S. Boning. 2022. FreDo: Frequency Domain-Based Long-Term Time Series Forecasting. arXiv. https://doi.org/10.48550/arXiv.2205.12301.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. “Sequence to Sequence Learning with Neural Networks.” Advances in Neural Information Processing Systems 27. https://papers.nips.cc/paper_files/paper/2014/hash/5a18e133cbf9f257297f410bb7eca942-Abstract.html.

Szegedy, Christian, Wei Liu, Yangqing Jia, et al. 2015. “Going Deeper with Convolutions.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June, 1–9. https://doi.org/10.1109/CVPR.2015.7298594.

Taieb, Souhaib Ben, and Amir F. Atiya. 2016. “A Bias and Variance Analysis for Multistep-Ahead Time Series Forecasting.” IEEE Transactions on Neural Networks and Learning Systems 27 (1): 62–76. https://doi.org/10.1109/TNNLS.2015.2411629.

Taylor, Sean J., and Benjamin Letham. 2018. “Forecasting at Scale.” The American Statistician 72 (1): 37–45. https://doi.org/10.1080/00031305.2017.1380080.

Tiao, George C., and Ruey S. Tsay. 1994. “Some Advances in Non-Linear and Adaptive Modelling in Time-Series.” Journal of Forecasting 13 (2): 109–131. https://doi.org/10.1002/for.3980130206.

Tolstikhin, Ilya O, Neil Houlsby, Alexander Kolesnikov, et al. 2021. “MLP-Mixer: An All-MLP Architecture for Vision.” Advances in Neural Information Processing Systems 34: 24261–24272. https://proceedings.neurips.cc/paper/2021/hash/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Abstract.html.

Touvron, Hugo, Piotr Bojanowski, Mathilde Caron, et al. 2023. “ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training.” IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (4): 5314–5321. https://doi.org/10.1109/TPAMI.2022.3206148.

Van den Oord, Aaron, Sander Dieleman, Heiga Zen, et al. 2016. WaveNet: A Generative Model for Raw Audio. arXiv. https://doi.org/10.48550/arXiv.1609.03499.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30. https://proceedings.neurips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

Wang, Huiqiang, Jian Peng, Feihu Huang, Jince Wang, Junhui Chen, and Yifei Xiao. 2022. “MICN: Multi-Scale Local and Global Context Modeling for Long-Term Series Forecasting.” Proceedings of the Eleventh International Conference on Learning Representations, September. https://openreview.net/forum?id=zt53IDUR1U.

Wang, Shiyu, Haixu Wu, Xiaoming Shi, et al. 2023. “TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=7oLshfEIC2.

Wang, Xue, Tian Zhou, Qingsong Wen, Jinyang Gao, Bolin Ding, and Rong Jin. 2023. “CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=MJksrOhurE.

Weiss, Andrew A. 1991. “Multi-Step Estimation and Forecasting in Dynamic Models.” Journal of Econometrics 48 (1): 135–149. https://doi.org/10.1016/0304-4076(91)90035-C.

Wen, Qingsong, Tian Zhou, Chaoli Zhang, et al. 2023. “Transformers in Time Series: A Survey.” Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence 6 (August): 6778–6786. https://doi.org/10.24963/ijcai.2023/759.

Wen, Ruofeng, and Kari Torkkola. 2019. “Deep Generative Quantile-Copula Models for Probabilistic Forecasting.” Proceedings of the Time Series Workshop at 36th International Conference on Machine Learning (Long Beach, California), July.

Wen, Ruofeng, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. 2018. “A Multi-Horizon Quantile Recurrent Forecaster.” Time Series Workshop at 31st Conference on Neural Information Processing Systems (Long Beach, California).

Wu, Haixu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2022. “TimesNet: Temporal 2d-Variation Modeling for General Time Series Analysis.” Proceedings of the Eleventh International Conference on Learning Representations, September. https://openreview.net/forum?id=ju_Uqw384Oq.

Wu, Haixu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. “Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting.” In Advances in Neural Information Processing Systems, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, vol. 34, 34. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2021/file/bcc0d400288793e8bdcd7c19a8ac0c2b-Paper.pdf.

Xu, Zhijian, Ailing Zeng, and Qiang Xu. 2023. “FITS: Modeling Time Series with $10k$ Parameters.” Proceedings of the Twelfth International Conference on Learning Representations, October. https://openreview.net/forum?id=bWcnvZ3qMb.

Yi, Kun, Qi Zhang, Wei Fan, et al. 2023. “Frequency-Domain MLPs Are More Effective Learners in Time Series Forecasting.” Advances in Neural Information Processing Systems 36 (December): 76656–76679. https://proceedings.neurips.cc/paper_files/paper/2023/hash/f1d16af76939f476b5f040fd1398c0a3-Abstract-Conference.html.

Zeng, Ailing, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. “Are Transformers Effective for Time Series Forecasting?” Proceedings of the AAAI Conference on Artificial Intelligence 37 (June): 11121–11128. https://doi.org/10.1609/aaai.v37i9.26317.

Zhang, Tianping, Yizhuo Zhang, Wei Cao, et al. 2022. Less Is More: Fast Multivariate Time Series Forecasting with Light Sampling-Oriented MLP Structures. arXiv. https://doi.org/10.48550/arXiv.2207.01186.

Zhang, Yunhao, and Junchi Yan. 2022. “Crossformer: Transformer Utilizing Cross-Dimension Dependency for Multivariate Time Series Forecasting.” Proceedings of the Eleventh International Conference on Learning Representations, September. https://openreview.net/forum?id=vSVLM2j9eie.

Zhou, Chengjie, Chao Che, Pengfei Wang, and Qiang Zhang. 2024. “SCAT: A Time Series Forecasting with Spectral Central Alternating Transformers.” Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence 6 (August): 5626–5634. https://doi.org/10.24963/ijcai.2024/622.

Zhou, Haoyi, Shanghang Zhang, Jieqi Peng, et al. 2021. “Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting.” Proceedings of the AAAI Conference on Artificial Intelligence 35 (May): 11106–11115. https://doi.org/10.1609/aaai.v35i12.17325.

Zhou, Tian, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. 2022. “FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting.” In Proceedings of the 39th International Conference on Machine Learning, edited by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, vol. 162, 162. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v162/zhou22g.html.

Zhou, Ziyu, Gengyu Lyu, Yiming Huang, Zihao Wang, Ziyu Jia, and Zhen Yang. 2024. “SDformer: Transformer with Spectral Filter and Dynamic Attention for Multivariate Time Series Long-Term Forecasting.” Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence 6 (August): 5689–5697. https://doi.org/10.24963/ijcai.2024/629.