Long-Term Time Series Forecasting
July 14, 2025
Introduction into LTSF
TSF has a long and extensive literature history. However, this work will primarily focus on recent developments in (long-term) TSF. For a comprehensive overview of earlier research and traditional TSF methods, readers are referred to existing surveys such as Box (2013; Box et al. 2015; De Gooijer and Hyndman 2006; Mahalakshmi et al. 2016; Hamilton 1994). Some traditional statistical time series forecasting models, such as ARIMA (Box and Pierce 1970) or Prophet (Taylor and Letham 2018), are still popular to this day (Long et al. 2023; Ning et al. 2022; Albahli 2025). However, they are often fit separately to each time series, come with many prior assumptions and their performances may deteriorate for long-range forecasting, making them unsuitable for large scale TSF tasks (Qin et al. 2017; Li et al. 2019). Therefore, similar to other domains, TSF research showed an increasingly large interest towards deep learning based approaches (Benidis et al. 2022; Hewamalage et al. 2021; Lara-Benítez et al. 2021).
At first, primarily recurrent neural networks (RNNs), which are specifically designed to work with sequential data, were adopted in the form of sequence-to-sequence architectures (Sutskever et al. 2014). Furthermore, many SOTA performances across TSF tasks with shorter forecasting horizons stem from models of this architecture, e.g. TimeGrad (Rasul et al. 2021), DA-RNN (Qin et al. 2017) or DeepAR (Salinas et al. 2020). In contrast, Convolutional neural networks (CNNs), which are designed for tasks where the input data has a known sequential or spatial structure, such as images or audio signals (Dosovitskiy et al. 2021; Van den Oord et al. 2016) but also time series (Benidis et al. 2022; Goodfellow et al. 2016), began to demonstrate superior performance over RNNs in various sequence modeling tasks, e.g. audio generation or machine translation (Van den Oord et al. 2016; Kalchbrenner et al. 2017). Motivated by these results, Bai et al. (2018) conducted a comprehensive comparison between CNNs and RNNs across a diverse set of sequential learning benchmarks. Their findings showed that a simple convolutional architecture, the Temporal Convolutional Network (TCN), consistently outperformed RNN-based models while also benefiting from longer effective memory. These promising results spurred increased interest in applying CNNs to time series forecasting as well. For instance, Borovykh et al. (2018) adapted the autoregressive WaveNet CNN architecture (Van den Oord et al. 2016), originally developed for raw audio synthesis, to the TSF domain and demonstrated superior performance over LSTM-based models. DeepGLO (Sen et al. 2019) combines a matrix factorization model with a TCN and outperforms traditional and RNN-based methods. However, despite their promising empirical performance, CNNs have not emerged as a definitive replacement for RNNs. Instead, the two architectures were generally viewed as complementary, with approximation theory (Jiang et al. 2021) supporting the idea that each bring distinct strengths to time series modeling. Therefore, hybrid models like LSTNet (Lai et al. 2018) or DCRNN (Li et al. 2018) gained popularity by combining CNNs and RNNs, effectively capturing both short-term dependencies and inter-series correlations through CNNs, while leveraging RNNs for modeling longer-term temporal trends. Nevertheless, both RNNs and CNNs exhibit inherent limitations when it comes to longer forecasting horizons. The main limitation of RNNs are their large information propagation paths, which directly lead to numerous issues. In particular, RNNs have performance problems in capturing long-term dependencies with poor efficiency in sequential calculations (Jia et al. 2024). Furthermore, although RNN-cells, such as LSTM (Hochreiter and Schmidhuber 1997) or GRU (Cho et al. 2014), were designed to tackle vanishing and exploding gradients (Bengio et al. 1994), those problems often could not be mitigated sufficiently for longer input sequences leading to an unstable training process (Zhou et al. 2021). On the other hand, CNNs are limited by their local receptive fields; while some argue that they offer better long-term memory than RNNs (Bai et al. 2018), their 1D convolutions can only model variations in adjacent time steps (Wu et al. 2022). Therefore, compared to models with global receptive fields, e.g. Transformers (Vaswani et al. 2017) or MLP-based architectures (Zeng et al. 2023), CNNs often fall short in handling the complexity of long-term temporal dependencies (Donghao and Xue 2023). Altogether, these limitations are critical in TSF tasks, which often require models to capture both short- and long-term repeating patterns (Lai et al. 2018). In the context of long-term TSF, the importance of modeling long-range dependencies becomes even more pronounced, as they tend to be more dispersed and harder to learn (Li et al. 2019).
In response to these challenges, Transformer-based models (Vaswani et al. 2017) were proposed as a promising alternative (Zhou et al. 2021; Li et al. 2019), offering a self-attention mechanism, which allows the model to access the entire input sequence at once, facilitating parallel processing and enabling global context understanding. Furthermore, Transformers have displayed state-of-the-art performances in capturing long-range dependency structures (Wen et al. 2023) and are SOTA across various domains, e.g. natural language processing (Brown et al. 2020), speech (Kim et al. 2022) and computer vision (Dosovitskiy et al. 2021). LogSparse, proposed by Li et al. (2019), was among the first Transformer-based methods applied to TSF. It demonstrated superior performance in modeling long-term dependencies compared to DeepAR and statistical models. Although Li et al. (2019) extended the forecasting horizon relative to earlier work, the input and output sequences they considered were still shorter compared to modern LTSF tasks. A breakthrough came when Zhou et al. (2021) introduced Informer and formalized the modern LTSF problem setting by substantially extending input and prediction horizons. Informer managed to outperform prior SOTA models including LogSparse, DeepAR, other RNN-based and statistical baselines in LTSF. A key innovation of Informer came with its switch to a DMS strategy (Zeng et al. 2023), which contrasts the IMS approach used in earlier methods. Moreover, models that follow an IMS strategy are prone to slow inference and error accumulation, issues that become particularly problematic with longer forecast lengths (Zhou et al. 2021).
In succession, the DMS strategy was successfully adopted by most SOTA LTSF models, see Table [tab:ltsf]. However, DMS forecasting is not novel. In fact, the first occurance of a DMS prediction model can be dated back to Cox (1961). Over the years several theoretical and empirical studies have shown that the direct strategy performs better when models are misspecified, i.e. the model class does not contain the true model, while the recursive approach tends to be superior for well-specified models (Weiss 1991; Tiao and Tsay 1994; Ing 2007; Chevillon and Hendry 2005). In summary, Chevillon (2007) showed that DMS is less biased, more stable, more efficient and more robust to model misspecification. Later on, Taieb and Atiya (2016) investigated different multi-step strategies with NNs in TSF and concluded that IMS is preferable for short-term forecasts when the model is likely well-specified, whereas DMS is better suited for long time series or situations where minimizing bias is crucial. Despite these findings the IMS strategy was still more popular around that time, part of the reason is that it is highly similar to well-studied autoregressive and Markovian modeling assumptions while benefiting from shorter forecasting horizons as well (Wen et al. 2018). Moreover, DMS was regarded as costly, since, without cross-learning, it required training separate models for each horizon step (Bontempi et al. 2013). However, this drawback became negligible with newer architectures efficiently sharing parameters across time steps, for example only requiring small changes in the prediction head while enabling faster prediction speeds (Zhou et al. 2021). Prior to Informer, other deep learning models also adopted DMS strategies. For instance, MQ-RNN and MQ-CNN (Wen et al. 2018) use shared-parameter decoders at each time step to produce forecasts. Building on MQ-CNN, Wen and Torkkola (2019) added a generative quantile copula improving the forecast quality. NBeats (Oreshkin et al. 2019) is built on a deep residual stack of MLPs, whereas DeepTCN (Chen et al. 2019) is a CNN-based DMS approach. Nonetheless, the DMS strategy has important drawbacks: it treats the forecasted points as independent, overlooking their mutual dependencies (Kline 2004; Bontempi et al. 2013) and it must be retrained whenever the forecast horizon is extended.
The breaktrough of Informer led to a rising adoption of LTSF models, specifically Transformer models. However, despite their advantages, the memory and time complexity of self-attention in Transformers grows quadratically \(O(L^2)\) with the input length \(L\), becoming a large bottleneck for long input sequences present in LTSF (Zhou et al. 2021). Hence, many of the first Transformer-based models for LTSF focused on improving the efficiency of the attention module, in which Wen et al. (2023) classify the approaches into two branches. On the one hand, models such as LogSparse (Li et al. 2019) or Pyraformer (S. Liu et al. 2021) tried enforcing a sparsity bias into the attention module. On the other hand, Informer (Zhou et al. 2021) or FEDformer (Zhou et al. 2022) analyzed low-rank properties of the self-attention matrix. Furthermore, in their respective LTSF studies each model manages to outperform previous traditional and RNN-based SOTA methods, such as ARIMA, Prophet or DeepAR on a variety of LTSF data sets (Zhou et al. 2021; Wu et al. 2021; S. Liu et al. 2021; Li et al. 2019; Zhou et al. 2022). Despite their performances, Zeng et al. (2023) point out that they were evaluated solely against IMS approaches and suggest that the observed improvement is primarily due to the adoption of the DMS strategy rather than the Transformer architecture itself. To investigate this, Zeng et al. (2023) introduce DLinear and NLinear, two simple linear MLP DMS models, which were able to outperform the Transformer-based methods on multiple different benchmarks. Thus, challenging the effectiveness of Transformers on LTSF tasks. An important aspect of DLinear and NLinear is that they are CI methods, therefore they mitigate from modeling potentially misleading cross-channel dependencies (Nie et al. 2022). In contrast, many previous methods (Zhou et al. 2021; Wu et al. 2021; Zhou et al. 2022) tried to incorporate information from all channels via a CD strategy, but this approach appeared to be ineffective in comparison. Building on the success of Zeng et al. (2023) with the CI strategy, many LTSF models adopted it successfully, see Table [tab:ltsf]. Furthermore, L. Han, Ye, et al. (2024) investigate the relation between CI and CD methods more in-depth. By comparing a linear CI model to its CD counterpart, they propose that the CI approach exhibits less distribution shift, because the sum of correlation differences between train and test data has lower variation than the correlation differences of individual channels. Subsequently, L. Han, Ye, et al. (2024) propose that CD methods have high capacity and low robustness, whereas CI approaches have low capacity and high robustness. Lastly, coming to the conclusion that robustness is often more important in real-world non-stationary time series with distribution shifts; therefore, CI methods often perform better.
Since the work of Zeng et al. (2023) challenged the effectiveness of Transformers in LTSF, this opened the door for other architectures to gain back some ground. Furthermore, in what follows some important recent contributions of LTSF models are briefly described, for a detailed categorization of these models see Table [tab:ltsf].
MLP architectures.
The success of DLinear (Zeng et al. 2023) revived interest in pure MLP architectures for LTSF. At the same time, the computer vision community saw the rise of MLP-Mixer models (Tolstikhin et al. 2021; H. Liu et al. 2021; Touvron et al. 2023), which use simple MLPs to mix information within and across image input patches, achieving competitive results without relying on convolutions or self-attention. Building on this, TSMixer (Ekambaram et al. 2023) adapts the Mixer architecture for LTSF, leveraging its well-suited compatibility with sequential data due to the preservation of input order. TSMixer uses a patch-based MLP backbone enhanced with online reconciliation heads that capture hierarchical structure and cross-channel dependencies. Following TSMixer, several studies extended the idea to address specific challenges in time series modeling. TimeMixer (S. Wang et al. 2023) leverages multiscale-mixing, differentiating finer seasonal patterns and coarser trends through novel mixing blocks. U-Mixer (Ma et al. 2024) tackles the issue of non-stationarity by arranging MLP encoder-decoder blocks in a U-Net structure (Ronneberger et al. 2015) while also introducing a stationarity correction mechanism. Furthermore, HDMixer (Huang, Shen, et al. 2024) improves fixed-sized patching via length-extendable patching while also modeling hierarchical short- and long-range dynamics. Beyond Mixer-based architectures, a range of MLP-centric models have emerged that take alternative approaches to enhancing time series forecasting performance. NHITS (Challu et al. 2023) extends NBEATS (Oreshkin et al. 2019) by introducing hierarchical interpolation and multi-rate sampling to sequentially assemble forecasts across multiple temporal resolutions. FreTS (Yi et al. 2023) operates entirely in the frequency domain, using MLPs to learn real and imaginary components of transformed series. CycleNet (Lin, Lin, Hu, et al. 2024) leverages residual cycle forecasting to explicitly model periodic components. SOFTS (L. Han, Chen, et al. 2024) proposes a centralized STAR module to model inter-channel relationships more efficiently than attention mechanisms. Finally, TiDE (Das et al. 2023) employs a simple MLP-based encoder-decoder framework that combines the speed of linear models with the ability to capture nonlinear dependencies.
Transformers.
Despite the success of MLP-based approaches, Transformers remained the popular choice for LTSF tasks, see Table [tab:ltsf]. One reason for this was the introduction of PatchTST (Nie et al. 2022), which marked a turning point for Transformer-based models in time series forecasting. It adopts the CI strategy of DLinear (Zeng et al. 2023) while also introducing patching to TSF. Patching, inspired by Vision Transformers (Dosovitskiy et al. 2021), segments a time series into subseries-level patches. It allows the model to capture local semantic patterns, reduce attention complexity, and extend its receptive field, significantly boosting long-term forecasting accuracy (Nie et al. 2022). As a result, patching has since become a standard practice in time series Transformers, widely adopted in models like Crossformer (Zhang and Yan 2022), MCFormer (W. Han et al. 2024) and Pathformer (P. Chen et al. 2023). In addition to its success in Transformer-based models, patching has been adopted across other architectural families, including MLPs (S.-A. Chen et al. 2023), CNNs (Gong et al. 2024), and RNNs (Lin et al. 2023). However, the dominance of classic fixed-length patching has recently been challenged. The MLP-based HDMixer (Huang, Shen, et al. 2024) critiques the inflexibility of fixed-length patches, which can lead to information loss at the patch boundaries. It proposes length-extendable patches to better preserve local structure. In addition, DeformableTST (Luo and Wang 2024) highlights that modern Transformers have become overly reliant on patching to achieve strong performance, which limits their applicability in scenarios with short input sequences or tasks unsuited to patching. To address this, DeformableTST introduces deformable attention, a data-driven sparse attention mechanism capable of focusing on important time points without explicit patching, allowing the model to generalize across a broader range of forecasting tasks. Lastly, several works have sought alternatives to patching through other input transformations. Fredformer (Piao et al. 2024) applies a Discrete Fourier Transform to overcome frequency bias in attention, enabling more balanced learning across frequency bands. iTransformer (Liu, Hu, et al. 2023) takes a different route by inverting the input dimensions, treating time points as tokens and leveraging attention to capture multivariate correlations, improving scalability and performance without altering the Transformer’s core components.
Similar to patching, the standard Transformer encoder (Vaswani et al. 2017) has become a standard modeling choice for Transformer-based time series models. In many cases, the decoder is simply replaced with a basic flatten and linear head, e.g. MCFormer (W. Han et al. 2024), PatchTST (Nie et al. 2022), iTransformer (Liu, Hu, et al. 2023) and Fredformer (Piao et al. 2024). On top of that, many models make targeted replacements to the vanilla Transformer encoder, where it is common to make changes to the attention mechanism: Triformer (Cirstea et al. 2022) reduces complexity via triangular patch attention, SDformer (Z. Zhou et al. 2024) enhances expressiveness with spectral filtering and dynamic directional attention, SCAT (C. Zhou et al. 2024) introduces alternating attention using spectral clustering centers and CARD (X. Wang et al. 2023) aligns attention across channels to better model inter-channel dependencies. Similarly, CATS (Lu et al. 2024) removes self-attention altogether, opting for a cross-attention-only framework. To better capture long-range dependencies, Kang et al. (2024) introduce spectral attention, a frequency-based mechanism that preserves temporal patterns and improves gradient flow. Outside of encoder-only Transformer models, a few different architectures have been implemented as well. SMARTformer (Yiduo Li et al. 2023) adopts a full encoder-decoder Transformer architecture, but deviates from the standard non-autoregressive decoder commonly used in time series models. Crossformer (Zhang and Yan 2022) also uses an encoder-decoder architecture, but places special emphasis on modeling cross-dimension dependencies. To this end, it proposes a Two-Stage Attention mechanism within a hierarchical encoder-decoder structure that separately captures temporal and inter-variable correlations. In contrast, FPPformer (Shen et al. 2024) also retains the encoder-decoder setup but focuses on redesigning the decoder. It introduces a top-down decoder architecture, inspired by feature pyramid networks in computer vision (Lin et al. 2017), and enhances it with a combination of elementwise and patchwise attention to improve multiscale sequence reconstruction.
CNNs.
While Transformer- and MLP-based models have rapidly gained traction and became dominant in time series analysis, convolutional approaches have been falling out of favor (Donghao and Xue 2023). Nevertheless, several recent studies have achieved SOTA performance in LTSF using CNN-based models, renewing interest in convolutional methods. MICN (Wang et al. 2022) introduces a multi-scale convolutional architecture that captures both local features and global correlations, enabling separate modeling of trend and seasonality in time series forecasting. TimesNet (Wu et al. 2022) leverages the Fast Fourier Transform (FFT) to identify periodic patterns in time series data, which it then restructures into 2D tensors. Its core component, the TimesBlock, is built based on a convolutional inception block (Szegedy et al. 2015), enabling it to effectively model both inter-period and intra-period variations. PatchMixer (Gong et al. 2024) and ModernTCN (Donghao and Xue 2023) process time series in patches (Nie et al. 2022) and then utilize depthwise separable convolutions to achieve SOTA performance with faster training and inference speeds. Moreover, ModernTCN extends a convolution block better suited for time series, resulting in larger effective receptive fields.
RNNs.
Despite their limitations and general subpar performance in LTSF, RNNs occasionally resurfaced in LTSF research. Lin et al. (2023) identify the large number of recurrent iterations as a primary drawback of traditional RNNs. To address this, they propose SegRNN, which adopts a patching mechanism to reduce the number of recurrent steps when processing input time series. In addition, they employ a DMS strategy for prediction. This involves incorporating positional embeddings, as in Vaswani et al. (2017), which are combined with the last hidden state and then passed into a GRU cell with shared parameters. Jia et al. (2023) introduced WITRAN, which operates on rearranged 2D time series, i.e. a matrix of patches inspired by Wu et al. (2022). Then, they propose a novel RNN cell alongside the recurrent acceleration network, which processes the data points of the matrix vertically and horizontally, enabling parallel computation. Lastly, they decode the processed information with a MLP in a DMS fashion. Similarly, Jia et al. (2024) introduce TPGN, a dual-branch model that also uses a 2D representation to capture long- and short-term patterns. At its core is the Parallel Gated Network, which replaces the sequential structure of RNNs with a layer that aggregates information from previous time steps in parallel, reducing the propagation path to \(O(1)\).
Other model types.
Beyond common model archetypes, LTSF has recently seen novel architectures inspired by other domains. LLM-based models like LeRet (Huang, Zhou, et al. 2024), AutoTimes (Liu et al. 2024) or Time-LLM (Jin et al. 2023) leverage pre-trained language models by aligning time series with token-based representations, enabling few-shot and in-context forecasting. Graph-based models such as Ada-MSHyper (Shang et al. 2024), CrossGNN (Huang et al. 2023) and MSGNet (Cai et al. 2024) introduce graph structures to better capture multi-scale or inter-series correlations. Lastly, dynamical system-based approaches like Koopa (Liu, Li, et al. 2023) and Attraos (Hu et al. 2024) leverage Koopman embeddings to linearize complex dynamics or draw on chaos theory, respectively.
In addition to exploring different backbone NN architectures, studies have also examined the impact of other design choices.
Sparse models.
Despite the success of Zeng et al. (2023) with simple linear models, many of the previously discussed methods rely on significantly larger approaches with a high amount of parameters. To counter the trend toward increasingly large models, some methods focused on more efficient sparser models that often only implement one or a few linear layers. For instance, FITS (Xu et al. 2023), LightTS (Zhang et al. 2022), SSCNN (Deng et al. 2024), Attraos (Hu et al. 2024) and SparseTSF (Lin, Lin, Wu, et al. 2024) achieve performances comparable to SOTA methods while being several magnitudes smaller, resulting in faster training and inference speeds as well as a smaller memory footprint. These models first simplify the forecasting task by downsampling (Lin, Lin, Wu, et al. 2024; Zhang et al. 2022), by decomposition (Deng et al. 2024) or by operating in the frequency domain via FFT (Xu et al. 2023) or via phase space reconstruction (Hu et al. 2024). Then, they process the condensed representation with a smaller model, often containing only a single (non-) linear layer.
Channel dependence.
The success of DLinear (Zeng et al. 2023) and PatchTST (Nie et al. 2022) with the CI strategy led to many subsequent CI models such as Pathformer (P. Chen et al. 2023), CATS (Kim et al. 2024) and DeformableTST (Luo and Wang 2024). However, growing interest in leveraging inter-series correlations developed into a resurgence of CD methods, which can be broadly categorized by their mechanism of capturing cross-channel interactions. A large subset utilizes cross-channel attention, with models like Crossformer (Zhang and Yan 2022), CARD (X. Wang et al. 2023), Client (Gao et al. 2023) and MCformer (W. Han et al. 2024) incorporating attention modules to jointly model temporal and inter-channel dependencies. Another line of work applies spectral or frequency-based modeling, such as SDformer (Chen et al. 2024), Fredformer (Piao et al. 2024) and FreTS (Yi et al. 2023), which leverage frequency-domain representations to capture global dependencies and improve channel interaction modeling. Meanwhile, MLP-Mixer-based architectures offer an alternative to attention-heavy designs. For instance, TSMixer (Ekambaram et al. 2023) introduces hybrid channel modeling, while SOFTS (L. Han, Chen, et al. 2024) similarly proposes a centralized STAR module to fuse global and intra-channel representations. On another note, models like TimeMixer (S. Wang et al. 2023), ModernTCN (Donghao and Xue 2023) and MICN (Wang et al. 2022) explore multi-scale decomposition and convolutional modeling to disentangle and aggregate information across variables and temporal resolutions. Lastly, CrossGNN (Huang et al. 2023) applies graph-based modules to model cross-variable structure.
DMS dominance.
Although Li et al. (2019) were among the first to apply Transformers to LTSF in an IMS setting, nearly all major recent models for LTSF adopt a DMS forecasting strategy, see Table [tab:ltsf]. This trend can be traced back to Informer (Zhou et al. 2021), which popularized the use of non-autoregressive decoding to mitigate error accumulation in long-range predictions of IMS methods, as mathematically shown by Sun and Boning (2022). Even recurrent architectures, which are closely related to IMS forecasting, have adopted a DMS strategy for LTSF (Lin et al. 2023; Jia et al. 2023). Two recent works stand out as rare exceptions that reintroduce autoregressive principles into LTSF. SMARTformer (Yiduo Li et al. 2023) proposes a semi-autoregressive (SAR) decoding approach, consisting of two key components: a segment autoregressive layer that generates the forecast iteratively in segments, and a non-autoregressive refining layer that globally refines the output in a DMS manner. This hybrid structure captures both local and global temporal patterns. Empirical results show that SMARTformer achieves consistent improvements in both univariate and multivariate forecasting tasks while an ablation study highlights that other SOTA LTSF methods also benefit from a SAR decoder. On the other hand, AutoTimes (Liu et al. 2024) leverages the autoregressive nature of LLMs to forecast time series through token-wise next-step prediction. However, its main novelty lies in repurposing decoder-only LLMs for time series.
[tab:ltsf]
| Model | Venue | IMS/DMS | Backbone | CI/CD |
|---|---|---|---|---|
| LogSparse (Li et al. 2019) | NeurIPS’19 | IMS | Transformer (D) | CD |
| Autoformer (Wu et al. 2021) | NeurIPS’21 | DMS | Transformer (E-D) | CD |
| Informer (Zhou et al. 2021) | AAAI’21 | DMS | Transformer (E-D) | CD |
| Triformer (Cirstea et al. 2022) | IJCAI’22 | DMS | Transformer (E) | CD |
| LightTS (Zhang et al. 2022) | - | DMS | MLP | CD |
| Koopa (Liu, Li, et al. 2023) | NeurIPS’23 | DMS | Koopman Theory (Koopman 1931) | CD |
| CrossGNN (Huang et al. 2023) | NeurIPS’23 | DMS | GNN | CD |
| WITRAN (Jia et al. 2023) | NeurIPS’23 | DMS | RNN | CI |
| FreTS (Yi et al. 2023) | NeurIPS’23 | DMS | MLP | CD |
| MICN (Wang et al. 2022) | ICLR’23 | DMS | CNN | CD |
| TimesNet (Wu et al. 2022) | ICLR’23 | DMS | CNN | CD |
| Crossformer (Zhang and Yan 2022) | ICLR’23 | DMS | Transformer (E-D) | CD |
| PatchTST (Nie et al. 2022) | ICLR’23 | DMS | Transformer (E) | CI |
| DLinear (Zeng et al. 2023) | AAAI’23 | DMS | MLP | CI |
| NHITS (Challu et al. 2023) | AAAI’23 | DMS | MLP | CD |
| SMARTformer (Yiduo Li et al. 2023) | IJCAI’23 | IMS & DMS | Transformer (E-D) | CD |
| TSMixer (Ekambaram et al. 2023) | KDD’23 | DMS | MLP | CI/CD |
| TiDE (Das et al. 2023) | TMLR’23 | DMS | MLP | CI |
| SegRNN (Lin et al. 2023) | - | DMS | RNN | CI |
| Client (Gao et al. 2023) | - | DMS | Transformer (E) | CD |
| Attraos (Hu et al. 2024) | NeurIPS’24 | DMS | Chaos Theory (Devaney 2018) | CI |
| Ada-MSHyper (Shang et al. 2024) | NeurIPS’24 | DMS | HGNN (Feng et al. 2019) | CI |
| SSCNN (Deng et al. 2024) | NeurIPS’24 | DMS | CNN & Decomposition | CI |
| SOFTS (L. Han, Chen, et al. 2024) | NeurIPS’24 | DMS | MLP | CD |
| CycleNet (Deng et al. 2024) | NeurIPS’24 | DMS | MLP | CI |
| CATS (Kim et al. 2024) | NeurIPS’24 | DMS | Transformer (E) | CI |
| DeformableTST (Luo and Wang 2024) | NeurIPS’24 | DMS | Transformer (E) | CI |
| TPGN (Liu et al. 2024) | NeurIPS’24 | DMS | RNN | CI |
| AutoTimes (Liu et al. 2024) | NeurIPS’24 | IMS | LLM (D) | CI |
| SparseTSF (Lin, Lin, Wu, et al. 2024) | ICML’24 | DMS | MLP | CI |
| SAMformer (Ilbert et al. 2024) | ICML’24 | DMS | Transformer (E) | CD |
| TimeMixer (P. Chen et al. 2023) | ICLR’24 | DMS | MLP | CD |
| Pathformer (P. Chen et al. 2023) | ICLR’24 | DMS | Transformer (E) | CI |
| Time-LLM (Jin et al. 2023) | ICLR’24 | DMS | LLM | CI |
| iTransformer (Liu, Hu, et al. 2023) | ICLR’24 | DMS | Transformer (E) | CD |
| FITS (Xu et al. 2023) | ICLR’24 | DMS | MLP | CI |
| CARD (X. Wang et al. 2023) | ICLR’24 | DMS | Transformer (E) | CD |
| ModernTCN (Donghao and Xue 2023) | ICLR’24 | DMS | CNN | CD |
| MSGNet (Cai et al. 2024) | AAAI’24 | DMS | GNN | CD |
| UMixer (Ma et al. 2024) | AAAI’24 | DMS | MLP | CD |
| HDMixer (Huang, Shen, et al. 2024) | AAAI’24 | DMS | MLP | CD |
| LeRet (Huang, Zhou, et al. 2024) | IJCAI’24 | DMS | LLM + Retentive Net | CI |
| PatchMixer (Gong et al. 2024) | IJCAI’24 | DMS | CNN | CI |
| SDformer (Z. Zhou et al. 2024) | IJCAI’24 | DMS | Transformer (E) | CD |
| SCAT (C. Zhou et al. 2024) | IJCAI’24 | DMS | Transformer (E) | CI |
| Fredformer (Piao et al. 2024) | KDD’24 | DMS | Transformer (E) | CD |
| MCformer (W. Han et al. 2024) | IoT-J’24 | DMS | Transformer (E) | CD |
In summary, the literature on point LTSF is extensive, with numerous methods achieving strong performances. Hence, determining the definitive state-of-the-art in point LTSF is challenging due to the vast and rapidly evolving literature. However, certain models, DLinear, PatchTST and iTransformer (Zeng et al. 2023; Nie et al. 2022; Liu, Hu, et al. 2023), have emerged as de facto standards for comparison, frequently adopted as baseline or comparative methods in a wide range of recent works (Jia et al. 2024, 2023; Lu et al. 2024; Lin et al. 2023; L. Han, Chen, et al. 2024; Lin, Lin, Hu, et al. 2024; Luo and Wang 2024; Hu et al. 2024; Shang et al. 2024). Consequently, we consider them representative of the current state-of-the-art in point LTSF. Nonetheless, the distinction between IMS and DMS strategies has been largely overlooked, with DMS decoding being often adopted by default. Moreover, DMS forecasting can underperform in certain settings, which has not been sufficiently investigated in prior work. To address this gap, we empirically examine scenarios when and why DMS may fall short, using multi-world examples to highlight the conditions under which IMS offers advantages. Furthermore, while SOTA point LTSF models are highly effective at predicting the conditional mean (Yuxin Li et al. 2023), many real-world scenarios require a more nuanced understanding of uncertainty, making probabilistic forecasts preferable. Hence, the next section reviews existing probabilistic models proposed for time series forecasting.