在spark2.0里mllib分为两个包,spark.mllib里是基于RDD的API,spark.ml里是基于 DataFrame的API。官方不会在基于RDD的mllib里添加新特性。所以建议使用ml包。在spark2.2时基于RDD的API会被废弃,到spark3.0会被彻底移除。
Pipelines主要概念-
DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.
ML使用Spark SQL里来的数据结构DataFrame作为数据集,DataFrame能存储各种数据类型,例如DataFrame可以有不同的列存储文本,特征向量,真标签,预测值等。
-
Transformer: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a DataFrame with predictions.
Transformer是一个能将一个DataFrame转换为另一个DataFrame的算法。
例如一个ML模型就是一个能将带特征的DataFrame转换为一个带预测结果的DataFrame的Transformer。 -
Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
Estimator是作用在DataFrame上产生Transformer的算法
-
Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML workflow.
Pipeline将多个Transformer和Estimators链接起来形成一个特定的ML工作流。
-
Parameter: All Transformers and Estimators now share a common API for specifying parameters.
所有Transformer和Estimator共享一个指定参数的API
Transformers
A Transformer is an abstraction that includes feature transformers and learned models. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:
- A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
-
A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.
Transformer是一个包含特征转换和学习模型的抽象概念。它实现了trandform()方法,能够将一个DataFrame转换成另一个。
Estimators
An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces a Model, which is a Transformer. For example, a learning algorithm such as LogisticRegression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.
Estimator是一个学习算法或任何可以用来训练数据的算法。它实现了
fit()方法,它接收一个DataFrame作为输入然后产生一个模型。例如,
LogisticRegression是个Estimator,调用它的fit()方法能够训练
出模型LogisticRegressionModel
Properties of pipeline components
Transformer.transform()s and Estimator.fit()s are both stateless. In the future, stateful algorithms may be supported via alternative concepts.
Transformer.transform()s 和 Estimator.fit()s 都是无状态的
将来会通过替代概念实现有状态的算法。
Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).
每个Transformer 或 Estimator 都有一个独一无二的ID,在指定参数时
会非常有用(下面会讨论)
Pipeline
In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:
在机器学习中,通过一系列算法处理数据和从数据里学习知识都是很正常的事
以简单的文本处理为例,它的工作流中会包括以下几个阶段
-
Split each document’s text into words.
将每个文档的文本切分为单词
-
Convert each document’s words into a numerical feature vector.
将每个文档转换为数字化的特征向量
-
Learn a prediction model using the feature vectors and labels.
使用特征向量和标签生成预测模型
MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. We will use this simple workflow as a running example in this section.
MLlib使用Pipeline表示这样的工作流,它包含了一系列按特定顺序的
PipelineStages (Transformers and Estimators)
如何工作?
Pipeline是由每个阶段都是Transformer 或 Estimator的一系列特定阶段组成。这些阶段都是有序的,输入DataFrame通过每个阶段时都会被转换。在 Transformer阶段,在DataFrame上调用transform()方法。在Estimator 阶段,fit()方法被调用产生一个Transformer (which becomes part of the PipelineModel, or fitted Pipeline),Transformer在DataFrame上调用transform()方法。
上图中上面一行Pipeline由三个阶段组成,前两个阶段Tokenizer和HashingTF都是Transformers(蓝色)。第三个阶段LogisticRegression是个Estimator
(红色)。下面一行代表通过这个Pipeline的数据流,圆柱体表示DataFrames
左边的Pipeline.fit()方法作用于含有文本和标签原始DataFrame
Tokenizer.transform()将文本切分为单词,并在DataFrame上增加一个单词列HashingTF.transform()将单词列转换为特征向量,并将向量列加入DataFrame
LogisticRegression是一个Estimator,Pipeline第一次调用LogisticRegression.fit()方法产生LogisticRegressionModel如果这个Pipeline有更多阶段,它会在将 DataFrame送入下个阶段之前调用LogisticRegressionModel’s transform()
上图是一个 PipelineModel,它和原始的 Pipeline都是三个阶段,但是原来的所有Estimators都变成Transformers了。当在测试集上调用PipelineModel’s transform()时,数据按序通过每个阶段,并在将它送到下阶段之前调用transform()方法。
Pipelines和PipelineModels确保训练集和测试集经过同样的特征处理过程
参数:
MLlib中Estimators和Transformers使用统一的API指定参数
Param是命名参数ParamMap一系列(parameter, value)键值对
给算法传参有两个主要方法:
- 为每个实例设置参数例如,如果lr是LogisticRegression的一个实例,可以调用 lr.setMaxIter(10)使lr.fit()最多调用10次
- 将ParamMap传入fit()或transform()所有在ParamMap都会覆盖原来通过set()方法指定的参数例如如果有两个LogisticRegression实例lr1或lr2,我们可以通过ParamMap同时指定最大迭代次数 ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)
代码示例:
示例:Estimator, Transformer, and Param
package org.apache.spark.examples.ml
// $example on$
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession
object EstimatorTransformerParamExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("EstimatorTransformerParamExample")
.getOrCreate()
// $example on$
// Prepare training data from a list of (label, features) tuples.
val training = spark.createDataFrame(Seq(
(1.0, Vectors.dense(0.0, 1.1, 0.1)),
(0.0, Vectors.dense(2.0, 1.0, -1.0)),
(0.0, Vectors.dense(2.0, 1.3, 1.0)),
(1.0, Vectors.dense(0.0, 1.2, -0.5))
)).toDF("label", "features")
// Create a LogisticRegression instance. This instance is an Estimator.
val lr = new LogisticRegression()
// Print out the parameters, documentation, and any default values.
println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")
// We may set parameters using setter methods.
lr.setMaxIter(10)
.setRegParam(0.01)
// Learn a LogisticRegression model. This uses the parameters stored in lr.
val model1 = lr.fit(training)
// Since model1 is a Model (i.e., a Transformer produced by an Estimator),
// we can view the parameters it used during fit().
// This prints the parameter (name: value) pairs, where names are unique IDs for this
// LogisticRegression instance.
println("Model 1 was fit using parameters: " + model1.parent.extractParamMap)
// We may alternatively specify parameters using a ParamMap,
// which supports several methods for specifying parameters.
val paramMap = ParamMap(lr.maxIter -> 20)
.put(lr.maxIter, 30) // Specify 1 Param. This overwrites the original maxIter.
.put(lr.regParam -> 0.1, lr.threshold -> 0.55) // Specify multiple Params.
// One can also combine ParamMaps.
val paramMap2 = ParamMap(lr.probabilityCol -> "myProbability") // Change output column name.
val paramMapCombined = paramMap ++ paramMap2
// Now learn a new model using the paramMapCombined parameters.
// paramMapCombined overrides all parameters set earlier via lr.set* methods.
val model2 = lr.fit(training, paramMapCombined)
println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
// Prepare test data.
val test = spark.createDataFrame(Seq(
(1.0, Vectors.dense(-1.0, 1.5, 1.3)),
(0.0, Vectors.dense(3.0, 2.0, -0.1)),
(1.0, Vectors.dense(0.0, 2.2, -1.5))
)).toDF("label", "features")
// Make predictions on test data using the Transformer.transform() method.
// LogisticRegression.transform will only use the 'features' column.
// Note that model2.transform() outputs a 'myProbability' column instead of the usual
// 'probability' column since we renamed the lr.probabilityCol parameter previously.
model2.transform(test)
.select("features", "label", "myProbability", "prediction")
.collect()
.foreach { case Row(features: Vector, label: Double, prob: Vector, prediction: Double) =>
println(s"($features, $label) -> prob=$prob, prediction=$prediction")
}
// $example off$
spark.stop()
}
}
如果运行失败请参考我的上篇文章
spark Exception in thread “main” java.lang.IllegalArgumentException: java.net.URISyntaxException
示例:Pipeline
package org.apache.spark.examples.ml
// $example on$
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession
object PipelineExample {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("PipelineExample")
.getOrCreate()
// $example on$
// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
(0L, "a b c d e spark", 1.0),
(1L, "b d", 0.0),
(2L, "spark f g h", 1.0),
(3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")
// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Now we can optionally save the fitted pipeline to disk
model.write.overwrite().save("/tmp/spark-logistic-regression-model")
// We can also save this unfit pipeline to disk
pipeline.write.overwrite().save("/tmp/unfit-lr-model")
// And load it back in during production
val sameModel = PipelineModel.load("/tmp/spark-logistic-regression-model")
// Prepare test documents, which are unlabeled (id, text) tuples.
val test = spark.createDataFrame(Seq(
(4L, "spark i j k"),
(5L, "l m n"),
(6L, "mapreduce spark"),
(7L, "apache hadoop")
)).toDF("id", "text")
// Make predictions on test documents.
model.transform(test)
.select("id", "text", "probability", "prediction")
.collect()
.foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
println(s"($id, $text) --> prob=$prob, prediction=$prediction")
}
// $example off$
spark.stop()
}
}
参考链接官网