--- title: M5 dataset keywords: fastai sidebar: home_sidebar summary: "Download and evaluate the M5 dataset." description: "Download and evaluate the M5 dataset." nb_path: "nbs/data_datasets__m5.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %}

Download data class

{% raw %}

class M5[source]

M5(source_url:str='https://github.com/Nixtla/m5-forecasts/raw/main/datasets/m5.zip')

M5(source_url: str = 'https://github.com/Nixtla/m5-forecasts/raw/main/datasets/m5.zip')

{% endraw %} {% raw %}
{% endraw %} {% raw %}
Y_df, X_df, S_df = M5.load('./data')
{% endraw %}

Test number of series

{% raw %}
n_series = 30_490
assert Y_df['unique_id'].unique().size == n_series
assert X_df['unique_id'].unique().size == n_series
assert S_df.shape[0] == 30_490
{% endraw %} {% raw %}
Y_df.head()
unique_id ds y
0 FOODS_1_001_CA_1 2011-01-29 3.0
1 FOODS_1_001_CA_1 2011-01-30 0.0
2 FOODS_1_001_CA_1 2011-01-31 0.0
3 FOODS_1_001_CA_1 2011-02-01 1.0
4 FOODS_1_001_CA_1 2011-02-02 4.0
{% endraw %} {% raw %}
X_df.head()
unique_id ds event_name_1 event_type_1 event_name_2 event_type_2 snap_CA snap_TX snap_WI sell_price
0 FOODS_1_001_CA_1 2011-01-29 nan nan nan nan 0 0 0 2.0
1 FOODS_1_001_CA_1 2011-01-30 nan nan nan nan 0 0 0 2.0
2 FOODS_1_001_CA_1 2011-01-31 nan nan nan nan 0 0 0 2.0
3 FOODS_1_001_CA_1 2011-02-01 nan nan nan nan 1 1 0 2.0
4 FOODS_1_001_CA_1 2011-02-02 nan nan nan nan 1 0 1 2.0
{% endraw %} {% raw %}
S_df.head()
unique_id item_id dept_id cat_id store_id state_id
0 FOODS_1_001_CA_1 FOODS_1_001 FOODS_1 FOODS CA_1 CA
1 FOODS_1_001_CA_2 FOODS_1_001 FOODS_1 FOODS CA_2 CA
2 FOODS_1_001_CA_3 FOODS_1_001 FOODS_1 FOODS CA_3 CA
3 FOODS_1_001_CA_4 FOODS_1_001 FOODS_1 FOODS CA_4 CA
4 FOODS_1_001_TX_1 FOODS_1_001 FOODS_1 FOODS TX_1 TX
{% endraw %}

Evaluation class

{% raw %}

class M5Evaluation[source]

M5Evaluation()

{% endraw %} {% raw %}
{% endraw %}

URL-based evaluation

The method evaluate from the class M5Evaluation can receive a url of a submission to the M5 competiton.

The results compared to the on-the-fly evaluation were obtained from the official evaluation.

{% raw %}
m5_winner_url = 'https://github.com/Nixtla/m5-forecasts/raw/main/forecasts/0001 YJ_STU.zip'
winner_evaluation = M5Evaluation.evaluate('data', m5_winner_url)
# Test of the same evaluation as the original one
test_close(winner_evaluation.loc['Total'].item(), 0.520, eps=1e-3)
winner_evaluation
{% endraw %}

Pandas-based evaluation

Also the method evaluate can recevie a pandas DataFrame of forecasts.

{% raw %}
m5_second_place_url = 'https://github.com/Nixtla/m5-forecasts/raw/main/forecasts/0002 Matthias.zip'
m5_second_place_forecasts = M5Evaluation.load_benchmark('data', m5_second_place_url)
second_place_evaluation = M5Evaluation.evaluate('data', m5_second_place_forecasts)
# Test of the same evaluation as the original one
test_close(second_place_evaluation.loc['Total'].item(), 0.528, eps=1e-3)
second_place_evaluation
{% endraw %}

Kaggle-Competition-M5 References

The evaluation metric of the Favorita Kaggle competition was the normalized weighted root mean squared logarithmic error (NWRMSLE). Perishable items have a score weight of 1.25; otherwise, the weight is 1.0.

{% raw %} $$ NWRMSLE = \sqrt{\frac{\sum^{n}_{i=1} w_{i}\left(log(\hat{y}_{i}+1) - log(y_{i}+1)\right)^{2}}{\sum^{n}_{i=1} w_{i}}}$$ {% endraw %}

Kaggle Competition Forecasting Methods 16D ahead NWRMSLE
LGBM [1] 0.5091
Seq2Seq WaveNet [2] 0.5129