带有规则（整数）的 LightGBM 数据集数据预处理 — lgb.convert_with_rules • lightgbm

尝试准备一个干净的数据集，以便放入 lgb.Dataset 中。Factor、character 和 logical 列被转换为整数。Factor 和 character 中的缺失值将填充 0L。Logical 中的缺失值将填充 -1L。

此函数返回并可选地接收“规则”，这些规则准确描述了如何转换列中的值。

仅包含 NA 值的列将由此函数转换，但不会出现在返回的 rules 中。

注意：在 LightGBM 的先前版本中，此函数名为 lgb.prepare_rules2。

lgb.convert_with_rules(data, rules = NULL)

参数

data: 要准备的 data.frame 或 data.table。
rules: 如果已使用，则为数据预处理程序提供的一组规则。这应该是一个 R 列表，其中名称是 data 中的列名，值是命名的字符向量，其名称是列值，其值是用于替换它们的新值。

返回值

一个包含清理后的数据集 (data) 和规则 (rules) 的列表。请注意，数据必须转换为矩阵格式 (as.matrix) 才能输入到 lgb.Dataset 中。

示例

# \donttest{
data(iris)

str(iris)
#> 'data.frame':	150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

new_iris <- lgb.convert_with_rules(data = iris)
str(new_iris$data)
#> 'data.frame':	150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : int  1 1 1 1 1 1 1 1 1 1 ...

data(iris) # Erase iris dataset
iris$Species[1L] <- "NEW FACTOR" # Introduce junk factor (NA)
#> Warning: invalid factor level, NA generated

# Use conversion using known rules
# Unknown factors become 0, excellent for sparse datasets
newer_iris <- lgb.convert_with_rules(data = iris, rules = new_iris$rules)

# Unknown factor is now zero, perfect for sparse datasets
newer_iris$data[1L, ] # Species became 0 as it is an unknown factor
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2       0

newer_iris$data[1L, 5L] <- 1.0 # Put back real initial value

# Is the newly created dataset equal? YES!
all.equal(new_iris$data, newer_iris$data)
#> [1] TRUE

# Can we test our own rules?
data(iris) # Erase iris dataset

# We remapped values differently
personal_rules <- list(
  Species = c(
    "setosa" = 3L
    , "versicolor" = 2L
    , "virginica" = 1L
  )
)
newest_iris <- lgb.convert_with_rules(data = iris, rules = personal_rules)
str(newest_iris$data) # SUCCESS!
#> 'data.frame':	150 obs. of  5 variables:
#>  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#>  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#>  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#>  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#>  $ Species     : int  0 3 3 3 3 3 3 3 3 3 ...
# }