分类决策树 回归决策树



What is a Decision Tree ?

什么是决策树?

Based on the dataset available a decision tree learns the if/else hierarchy which ultimately leads to a decision making. Decision Trees are widely used models used for classification and regression tasks in Machine Learning. The classification can range from being a binary classifier to a “multi class “ classification and the regression can predict the values for the test data ( or new instances) after being trained by the train dataset.

基于可用的数据集,决策树学习if / else层次结构,最终导致决策制定。 决策树是在机器学习中用于分类和回归任务的广泛使用的模型。 分类的范围可以从二元分类器到“多类”分类,回归可以在训练数据集训练后预测测试数据(或新实例)的值。

The Algorithm we discuss here is the CART (Classification and Regression) Algorithm, which uses Decision tree for making classifications and predictions using Scikit-Learn’s (Sklearn) DecisionTreeClassifier and DecisionTreeRegressor modules located in sklearn.tree.

该算法我们在这里讨论的是CART(分类和回归)算法,它采用决策树用于使用Scikit,了解位于sklearn.tree(Sklearn)DecisionTreeClassifierDecisionTreeRegressor模块分类和预测。

How a DecisionTreeClassifier works?

DecisionTreeClassifier如何工作?

A DecisionTreeClassifier has the below parameters passed in

DecisionTreeClassifier传递了以下参数

DecisionTreeClassifier(criterion=’gini’, max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=42, splitter=’best’)

DecisionTreeClassifier(criterion ='gini',max_depth = None ,max_features = None,max_leaf_nodes = None,min_impurity_decrease = 0.0,min_impurity_split = None, min_samples_leaf = 1,min_samples_split = 2 ,min_weight_fraction_leaf ===,pre状态,= “最佳” )

We discuss mainly about the parameter “Criterion = ‘gini’”. Gini is the criterion in which the decision tree has to consider while splitting the available features in the dataset (quality of the split), to make a decision.

我们主要讨论参数“ Criterion ='gini'”。 基尼(Gini)是在拆分数据集中的可用特征(拆分的质量)以做出决策时必须考虑的决策树。

Consider the below table to decide which feature (Gender/Occupation) will have the first split in the decision tree. The Quantitative measurement of the quality of the split is given by Gini Impurity — What’s the probability we classify the data point incorrectly?

考虑下表以确定哪个功能(性别/职业)将在决策树中进行第一次拆分。 基尼杂质(Gini Impurity)给出了分割质量的定量度量-我们对数据点进行错误分类的可能性是多少?




回归树是集成机器学习算法吗 回归树 分类树_人工智能

Step 1— Calculate the Gini impurity before splitting for the target column (Buying Apparel)

第1步-在拆分目标列之前计算基尼杂质(购买服装)

Gini overall = 1 — (probability of No)² — (probability of Yes)²

总基尼系数= 1 —(否的可能性)²—(是的可能性)²

This Gives Gini overall as (1-((4/8)²+(4/8)²)) = 0.5

这使Gini总体为(1-((4/8)²+(4/8)²))= 0.5

Step 2 — Select a feature (in this case we select Gender) to calculate Gini Split (Amount of impurity for a particular split).

步骤2 —选择一个特征(在这种情况下,我们选择“性别”)来计算基尼分裂(特定分裂的杂质含量)。

For M , (1-((2/5)²+(3/5)²)) = 0.48

对于M,(1-((2/5)²+(3/5)²))= 0.48

For F , (1-((2/3)²+(1/3)²)) = 0.45

对于F,(1-((2/3)²+(1/3)²))= 0.45

For Gender (Weighted avarage) , (5/8)*0.48 + (3/8) * 0.45 = 0.47. Hence Gini split of Gender column = 0.47

对于性别(加权平均),(5/8)* 0.48 +(3/8)* 0.45 = 0.47。 因此,性别列的基尼系数= 0.47

Step 3 — Calculate Gini Gain (The amount of impurity removed using the split of a particular feature)

第3步-计算基尼增益(使用特定特征的分割去除的杂质量)

Gini Gain = Overall Gini — Gini Split = 0.5–0.47 =0.03

基尼收益=总体基尼系数–基尼系数= 0.5–0.47 = 0.03

Similarly Gini Gain for Occupation = 0

同样,职业的基尼增益= 0

The first split of the Tree happens with Node — Gender column since the Gini Gain is more (Higher the Gini Gain the first is the split of a particular feature).The recursive partitioning of the data is repeated until each region in the partition(each leaf in the decision tree) only contains a single target value (a single class or a single regression value). A leaf of the tree that contains data points that all share the same target value is called pure.

Tree的第一个拆分发生在Node — Gender列上,因为Gini增益更高( Gini增益越高,第一个是特定要素的拆分)。重复进行数据的递归分区,直到分区中的每个区域(每个分区决策树中的“叶子”仅包含单个目标值(单个类或单个回归值)。 包含所有共享相同目标值的数据点的树的叶子称为pure

We can also check the splitting of the features (which feature to split first)by using feature_importances_ method of DecisionTreeClassifier. DecisionTreeClassifier.feature_importances_ yields a 1D Array containing the scores of all the features ready for the split and based on the scores ,in descending order , the split for a decision tree takes place.

我们还可以使用feature_importances_检查功能的分割(先分割哪个功能) DecisionTreeClassifier的方法。 DecisionTreeClassifier.feature_importances_ 产生一维数组,其中包含准备好进行拆分的所有要素的分数,并根据分数按降序对决策树进行拆分。

W

w ^

In order to avoid over-fitting with respect to train dataset, a tree shall be maintained with its unnecessary branches pruned (cut down) , particularly not to reach the classification to reach min_sample_split to 2and min_sample_leaf to 1. There are two methods — Pre-pruning (limiting the tree before the split happens) and Post-pruning (pruning the tree after the classification happens till the min_sample_leaf). Sci-kit learn accommodates only Pre-pruning, which can be done by the three parameters max_depth(Limiting the levels of the tree split),min_samples_leaf( Threshold no.of samples for stopping the split and making the node as leaf node), min_samples_split (Least number of samples to be present at a given node for the split to happen). By controlling (hit and try)the above parameters we can achieve the train and test accuracy with out the problem of overfitting the model for a given dataset.

为了避免相对于火车数据集过拟合,应保留一棵树,并修剪掉其不必要的分支,尤其是不要达到将min_sample_split设置为2并将min_sample_leaf设置为1的分类。有两种方法-Pre-修剪(在拆分之前限制树)和后期修剪(在分类之后直到min_sample_leaf修剪树)。 Sci-kit学习仅容纳预修剪,这可以通过以下三个参数来完成: max_depth(限制树拆分的级别),min_samples_leaf(用于停止拆分并使节点成为叶节点的样本阈值数量),min_samples_split (在给定节点上发生拆分的最少样本数量)。 通过控制(命中并尝试)上述参数,我们可以达到训练和测试的准确性,而不会为给定的数据集过度拟合模型。

DecisionTreeClassifier for breast_cancer data available in sklearn datasets

sklearn数据集中可使用的乳腺癌乳腺癌决策树分类器


回归树是集成机器学习算法吗 回归树 分类树_回归树是集成机器学习算法吗_02

Splitting in decision tree is not prone to normalization or standardization of features, moreover, Decision Tree algorithm works best for features in different scales along with a mix of discreet and continuous features

决策树中的分割不容易对特征进行标准化或标准化,此外,决策树算法最适合于不同尺度的特征以及离散特征和连续特征的混合

The main downside of decision trees is that even with the use of pre-pruning, they tend to overfit and provide poor generalization performance. Therefore, in most applications, the ensemble methods are usually used in place of a single decision tree.

决策树的主要缺点是,即使使用了预修剪,它们也趋于过度拟合并提供较差的泛化性能。 因此,在大多数应用中,通常使用集成方法代替单个决策树。


翻译自: https://medium.com/@jujjavarapurpratap/how-does-a-decision-tree-classifier-works-in-sci-kit-learn-f1678167ace9

分类决策树 回归决策树