LightGBM 的主交叉验证逻辑 — lgb.cv • lightgbm

LightGBM 使用的交叉验证逻辑

lgb.cv(
  params = list(),
  data,
  nrounds = 100L,
  nfold = 3L,
  obj = NULL,
  eval = NULL,
  verbose = 1L,
  record = TRUE,
  eval_freq = 1L,
  showsd = TRUE,
  stratified = TRUE,
  folds = NULL,
  init_model = NULL,
  early_stopping_rounds = NULL,
  callbacks = list(),
  reset_data = FALSE,
  serializable = TRUE,
  eval_train_metric = FALSE
)

参数

params

一个参数列表。有关参数和有效值的列表，请参阅文档的“参数”部分。

data

一个用于训练的 lgb.Dataset 对象。一些函数，例如 lgb.cv，可能允许您传递其他类型的数据，例如 matrix，然后单独提供 label 作为关键字参数。

nrounds

训练轮数

nfold

原始数据集被随机划分为 nfold 个等大小的子样本。

obj

目标函数，可以是字符串或自定义目标函数。示例包括 regression、regression_l1、huber、binary、lambdarank、multiclass、multiclass

eval

评估函数。可以是字符串向量、函数或包含字符串和函数的混合列表。

a. 字符串向量：如果您向此参数提供字符串向量，它应包含有效的评估指标字符串。有关有效指标列表，请参阅文档的“metric”部分。
b. 函数：您可以提供自定义评估函数。此函数应接受关键字参数 preds 和 dtrain，并应返回一个包含三个元素的命名列表
- name：一个字符串，包含指标的名称，用于打印和存储结果。
- value：一个数字，表示给定预测值和真实值的指标值
- higher_better：一个布尔值，表示较高的值是否表示更好的拟合。例如，对于 MAE 或 RMSE 等指标，此值为 FALSE。
c. 列表：如果给定列表，则应只包含字符串向量和函数。这些应遵循上述描述中的要求。

verbose

输出的详细程度，如果 <= 0 且已提供 valids，也将禁用训练期间的评估打印

record

布尔值，TRUE 将迭代消息记录到 booster$record_evals

eval_freq

评估输出频率，仅在 verbose > 0 且已提供 valids 时有效

showsd

boolean，是否显示交叉验证的标准差。此参数默认为 TRUE。将其设置为 FALSE 可以通过避免不必要的计算来稍微加快速度。

stratified

一个 boolean 值，表示是否应根据结果标签的值对折叠进行分层抽样。

folds

list 提供了一种使用预定义交叉验证折叠列表的可能性（每个元素必须是测试折叠索引的向量）。提供折叠时，将忽略 nfold 和 stratified 参数。

init_model

模型文件路径或 lgb.Booster 对象，将从此模型继续训练

early_stopping_rounds

整数。激活提前停止。当此参数为非空时，如果在任何验证集上任何指标的评估在连续 early_stopping_rounds 个提升轮次中未能改进，训练将停止。如果训练提前停止，返回的模型将具有属性 best_iter，设置为最佳迭代的迭代次数。

callbacks

在每次迭代时应用的 callback 函数列表。

reset_data

布尔值，将其设置为 TRUE（非默认值）将把 booster 模型转换为 predictor 模型，从而释放内存和原始数据集

serializable

是否通过诸如 save 或 saveRDS 等函数使结果对象可序列化（参见“模型序列化”部分）。

eval_train_metric

boolean，是否添加训练数据上的交叉验证结果。此参数默认为 FALSE。将其设置为 TRUE 会增加运行时间。

返回值

一个训练好的模型 lgb.CVBooster。

提前停止

“提前停止”是指如果在给定的验证集上模型的性能连续多个迭代没有改善，则停止训练过程。

如果为 eval 提供了多个参数，它们的顺序将被保留。如果您通过在 params 中设置 early_stopping_rounds 来启用提前停止，则默认情况下所有指标都将被考虑用于提前停止。

如果您只想考虑第一个指标进行提前停止，请在 params 中传递 first_metric_only = TRUE。请注意，如果您也在 params 中指定了 metric，则该指标将被视为“第一个”指标。如果您省略 metric，将根据您选择的参数 obj（关键字参数）或 objective（传递到 params 中）使用默认指标。

示例

# \donttest{
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(
  objective = "regression"
  , metric = "l2"
  , min_data = 1L
  , learning_rate = 1.0
  , num_threads = 2L
)
model <- lgb.cv(
  params = params
  , data = dtrain
  , nrounds = 5L
  , nfold = 3L
)
#> [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000630 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 232
#> [LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 116
#> [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000627 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 232
#> [LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 116
#> [LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000480 seconds.
#> You can set `force_row_wise=true` to remove the overhead.
#> And if memory is not enough, you can set `force_col_wise=true`.
#> [LightGBM] [Info] Total Bins 232
#> [LightGBM] [Info] Number of data points in the train set: 4342, number of used features: 116
#> [LightGBM] [Info] Start training from score 0.474436
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Info] Start training from score 0.490557
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Info] Start training from score 0.481345
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [1]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [2]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [3]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [4]:  valid's l2:0.000307078+0.000434274 
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
#> [5]:  valid's l2:0.000307078+0.000434274 
# }