自迴歸滯後模型進行多變數時間序列預測

下圖顯示了關於不同型別葡萄酒銷量的月度多元時間序列。每種葡萄酒型別都是時間序列中的一個變數。

假設要預測其中一個變數。比如，sparkling wine。如何建立一個模型來進行預測呢？

一種常見的方法是將該變數其視為單變數時間序列。這樣就有很多方法可以用來模擬這些系列。比如 ARIMA、指數平滑或 Facebook 的 Prophet，還有自迴歸的機器學習方法也可以使用。

但是其他變數可能包含sparkling wine未來銷售的重要線索。看看下面的相關矩陣。

可以看到sparkling wine的銷量（第二排）與其他葡萄酒的銷量有相當的相關性。所以在模型中包含這些變數可能是一個好主意。

本文將介紹可以透過一種稱為自迴歸分佈滯後（ARDL）的方法來做到這一點。

Auto-Regressive Distributed Lag

ARDL模型採用自迴歸。自迴歸是大多數單變數時間序列模型的基礎。它主要分為兩個步驟。

首先將（單變數）時間序列從一個值序列轉換為一個矩陣。可以用用延時嵌入法（time delay embedding）來做到這一點。儘管名字很花哨，但這種方法非常簡單。它基於之前的最近值對每個值進行建模。然後建立一個迴歸模型。未來值表示目標變數。解釋變數是過去最近的值。

多元時間序列的思路與此類似，我們可以將其他變數的過去值新增到解釋變數中。這就是了被稱為自迴歸分散式滯後方法。分散式滯後的意思指的是使用額外變數的滯後。

現在我們把他們進行整合，時間序列中一個變數的未來值取決於它自身的滯後值以及其他變數的滯後值。

程式碼實現

多變數時間序列通常是指許多相關產品的銷售資料。我們這裡以葡萄酒銷售時間序列為例。當然ARDL方法也適用於零售以外的其他領域。

轉換時間序列

首先使用下面的指令碼轉換時間序列。

import pandas as pd# https：//github。com/vcerqueira/blog/from src。tde import time_delay_embeddingwine = pd。read_csv（‘data/wine_sales。csv’， parse_dates=［‘date’］）# setting date as indexwine。set_index（‘date’， inplace=True）# you can simulate some data with the following code# wine = pd。DataFrame（np。random。random（（100， 6）），# columns=［‘Fortified’，‘Drywhite’，‘Sweetwhite’，# ‘Red’，‘Rose’，‘Sparkling’］）# create data set with lagged features using time delay embeddingwine_ds = ［］for col in wine：col_df = time_delay_embedding（wine［col］， n_lags=12， horizon=6）wine_ds。append（col_df）# concatenating all variableswine_df = pd。concat（wine_ds， axis=1）。dropna（）# defining target （Y） and explanatory variables （X）predictor_variables = wine_df。columns。str。contains（‘\（t\-’）target_variables = wine_df。columns。str。contains（‘Sparkling\（t\+’）X = wine_df。iloc［：， predictor_variables］Y = wine_df。iloc［：， target_variables］

將 time_delay_embedding 函式應用於時間序列中的每個變數（第 18-22 行）。第 23 行將結果與我們的資料集進行合併。

解釋變數（X）是每個變數在每個時間步長的最後 12 個已知值（第 29 行）。以下是它們如何查詢滯後 t-1（為簡潔起見省略了其他滯後值）：

目標變數在第30行中定義。這指的是未來銷售的6個值：

建立模型

準備好資料之後，就可以構建模型了。使用隨機森林進行一個簡單的訓練和測試迴圈。

from sklearn。model_selection import train_test_splitfrom sklearn。metrics import mean_absolute_error as maefrom sklearn。ensemble import RandomForestRegressor# train/test splitX_tr， X_ts， Y_tr， Y_ts = train_test_split（X， Y， test_size=0。3， shuffle=False）# fitting a RF modelmodel = RandomForestRegressor（）model。fit（X_tr， Y_tr）# getting forecasts for the test setpreds = model。predict（X_ts）# computing MAE errorprint（mae（Y_ts， preds））# 288。13

擬合模型之後（第11行），得到了測試集中的預測（第14行）。該模型的平均絕對誤差為288。13。

滯後引數的選擇

上面的基線使用每個變數的 12 個滯後作為解釋變數。這是在函式 time_delay_embedding 的引數 n_lags 中定義的。那麼應該如何設定這個引數的值呢？

很難先驗地說應該包括多少值，因為這取決於輸入資料和特定變數。

解決這個問題的一種簡單方法是使用特徵選擇。從相當數量的值開始，然後根據重要性評分或預測效能來修改這個數字，或者直接使用GridSearch進行超引數的搜尋。

我們這裡將簡單的演示一個判斷的過程：根據隨機森林的重要性得分選擇前 10 個特徵。

# getting importance scores from previous modelimportance_scores = pd。Series（dict（zip（X_tr。columns， model。feature_importances_）））# getting top 10 featurestop_10_features = importance_scores。sort_values（ascending=False）［：10］top_10_features_nm = top_10_features。indexX_tr_top = X_tr［top_10_features_nm］X_ts_top = X_ts［top_10_features_nm］# re-fitting the modelmodel_top_features = RandomForestRegressor（）model_top_features。fit（X_tr_top， Y_tr）# getting forecasts for the test setpreds_topf = model_top_features。predict（X_ts_top）# computing MAE errorprint（mae（Y_ts， preds_topf））# 274。36

前10個特徵比原始預測顯示出更好的預測效能。以下是這些功能的重要性：