ml模型

(Motivation)

You may hear about “no free lunch” (NFL) theorem, which indicates that there is no best algorithm for every data. One algorithm may perform well in one data but perform poorly in other data. That is why there are so many machine learning algorithms available to train data.

您可能听说过“没有免费的午餐”(NFL)定理,该定理表明没有针对每个数据的最佳算法。 一种算法可能在一个数据中表现良好,但在其他数据中表现不佳。 这就是为什么有这么多机器学习算法可用于训练数据的原因。

How do we know which machine learning model is the best? We cannot know until we experiment and compare the performance of different models. But experimenting with different models can be a mess, especially when you when to find the best parameters for your model with GridSearchCV.

我们如何知道哪种机器学习模型是最好的? 我们只有进行实验并比较不同模型的性能后才能知道。 但是尝试不同的模型可能会很混乱,尤其是当您何时使用GridSearchCV为模型找到最佳参数时。

For example, when we finished experimenting with RandomForestClassifier and switched to SVC, we might wish to save the parameters of RandomForestClassifier in case we want to reproduce the results we have with RandomForestClassifier. But how do we save these parameters efficiently?

例如,当我们完成对RandomForestClassifier的实验并切换到SVC时,我们可能希望保存RandomForestClassifier的参数,以防重现我们使用RandomForestClassifier获得的结果。 但是,我们如何有效地保存这些参数?

Wouldn’t it be nice if we have the information about each model saved in different configuration files like below?

如果将有关每个模型的信息保存在如下所示的不同配置文件中,那会不会很好?

experiments/
├── data_preprocess.yaml
├── hyperparameters.yaml
└── model
    ├── random_forest.yaml
    └── svc.yaml

Each file under model will specify their parameters like this

model下的每个文件都将指定其参数,如下所示

When we want to use a specific model (let’s say RandomForestClassifier), all we need to do is to run the training file and specify the model we want to train with model=modelname

当我们要使用特定模型时(比如说RandomForestClassifier),我们要做的就是运行训练文件并使用model=modelname指定要训练的model=modelname

python train.py model=random_forest

Being able to do this has helped me experiment with different models much faster without being afraid of losing the hyperparameters of a particular model used for GridSearchCV. This article will show you how to switch between different models effortlessly like above with Hydra.

能够做到这一点帮助我更快地尝试了不同的模型,而不必担心丢失用于GridSearchCV的特定模型的超参数。 本文将向您展示如何像上面那样使用Hydra轻松地在不同模型之间切换。

(1: Add Hydra)

Hydra is a framework for elegantly configuring complex applications. I wrote how you could use hydra for your data science projects here. Besides configuring one configuration file, hydra also makes it much easier to work different configuration files.

Hydra是用于优雅配置复杂应用程序的框架。 我写了如何使用九头蛇为您的数据科学项目在这里 。 除了配置一个配置文件外,hydra还使处理不同的配置文件更加容易。

To follow this tutorial, simply clone this repo. This is the tree structure of our project

要遵循本教程,只需克隆此repo即可 。 这是我们项目的树形结构

.
├── data
├── experiments
│   ├── data_preprocess.yaml
│   ├── hyperparameters.yaml
│   └── model
│       ├── random_forest.yaml
│       └── svc.yaml
├── predict.py
├── preprocessing.py
├── train_pipeline.py
└── train.py

To access the config files in experiment , we will add hydra.main as the decorator of our function in train.py like below. The config_path specifies the path of the config files and config_name specifies the name of the config file

要访问配置文件experiment ,我们将添加hydra.main作为我们的函数的装饰train.py像下面。 该config_path指定配置文件的路径和config_name指定配置文件的名称

This is how hyperparameters.yaml looks like

这就是hyperparameters.yaml样子

YAML is the easiest language to understand and work with config files. As you can see, to access the training data, we simply use config.processed_data.text.train

YAML是最容易理解和使用配置文件的语言。 如您所见,要访问训练数据,我们只需使用config.processed_data.text.train

In hyperparameters.yaml, we put general information related to training such as data path, scoring of GridSearchCV, but not the hyperparameters for a specific model. We want to keep this file static while changing the configuration for the model based on the model we use.

在hyperparameters.yaml ,我们放置了与训练有关的一般信息,例如数据路径,GridSearchCV的评分,但没有放置特定模型的超参数。 我们希望在基于我们使用的模型更改模型的配置时,使此文件保持静态 。

If we want to start training our model with SVC, in hyperparameters.yaml , we set the default model with

如果要开始使用SVC训练模型,请在hyperparameters.yaml中将默认模型设置为

defaults:
   - model: svc

(Step 2: Configure Machine Learning Models)

Now when we run python train.py , Hydra will try to search for the file svc.yaml under the directory model

现在,当我们运行python train.py ,Hydra将尝试在目录模型下搜索文件svc.yaml

Thus, our next step is to create a file with the name svc.yaml (or any other name that you would like) under the model directory like this

因此,我们的下一步是在像这样的模型目录下创建一个名称为svc.yaml (或您想要的其他名称)的文件

experiments/
├── data_preprocess.yaml
├── hyperparameters.yaml
└── model
    ├── random_forest.yaml
    └── svc.yaml

Our svc.yaml file will contain the name of the model and the hyperparameters for GridSearchCV to search over

我们的svc.yaml文件将包含模型名称和GridSearchCV用来搜索的超参数

Now when we run

现在,当我们运行

python train.py

Hydra will automatically access the svc.yaml config within the model directory and use the parameters in svc.yaml!

Hydra将自动访问模型目录中的svc.yaml配置,并使用svc.yaml的参数!

If you want to instead use RandomForestClassifier, create the file named random_forest.yaml then insert the information about our RandomForestClassifier

如果要改为使用RandomForestClassifier,请创建名为random_forest.yaml的文件,然后插入有关RandomForestClassifier的信息

Instead of changing the default model in the hyperparameter.yaml file, we can override the default model in the terminal!

无需更改hyperparameter.yaml文件中的默认模型,我们可以在终端中覆盖默认模型!

python train.py model=random_forest

Now our function in train.py can get access to these parameters via config. For example, if I use svc model, this will be what I see

现在我们在train.py的函数可以通过config访问这些参数。 例如,如果我使用svc模型,这将是我看到的

>>> print(config.model)
SVC>>> print(hyperparameters))
{classifier__C: [.05, .12]classifier__kernel: ['linear', 'poly']classifier__gamma: [0.1, 1]classifier__degree: [0, 1, 2]
}

Pretty cool!

太酷了!

(Last Step: Config File into the Parameters for GridSearchCV)

To use turn the string ‘SVC’ into a class, use eval

要将字符串“ SVC”转换为类,请使用eval

from sklearn.svm import SVCclassifier = eval(config.model)()

Now classifier works like a normal SVC class!

现在分类器就像普通的SVC类一样工作!

To use the hyperparameters in our config file into a Python dictionary whose values are Python list, we use

要将配置文件中的超参数用于Python字典(其值为Python列表),请使用

Now you can freely switch between different models in the terminal like this

现在您可以像这样在终端中自由切换不同型号

python train.py model=random_forest

From now on, if you want to use a new model with different hyperparameters, all you need to do is to add the config file for that model then run

从现在开始,如果要使用具有不同超参数的新模型,则只需为该模型添加配置文件,然后运行

python train.py model=<newmodel>

Your model will run without the need of changing the code in train.py!

您的模型将运行,而无需更改train.py!的代码train.py!

(But what if I want to use Python function in the Config file?)

Sometimes, instead of writing down some specific values that are recognized by YAML file such as a listclassifier__C: [.05, .12], we might want to use the function that is not recognized by YAML such as classifier__C: np.logspace(-4, 4, 20)

有时,我们可能不想使用YAML文件可识别的某些特定值(例如列表classifier__C: [.05, .12] ,而不想使用YAML无法识别的函数,例如classifier__C: np.logspace(-4, 4, 20)

model: LogisticRegressionhyperparameters:   classifier__penalty: ['l1', 'l2']   classifier__C: np.logspace(-4, 4, 20)   classifier__solver: ['qn', 'lbfgs', 'owl']

Don’t worry! We can still find a way to work around that situation by wrapping eval() function around the string! This would turn your string np.logspace(-4, 4, 20) into a real python function!

不用担心 我们仍然可以通过将eval()函数包装在字符串周围来找到解决这种情况的方法! 这会将您的字符串np.logspace(-4, 4, 20)变成一个真正的python函数!

for key, value in param_grid.items():   if isinstance(value, str):       param_grid[key] = eval(param_grid[key])

Then turn all of your value in the dictionary into Python lists so they would be valid parameters for GridSearchCV

然后将字典中的所有值转换为Python列表,这样它们将成为GridSearchCV的有效参数

param_grid = {ele: (list(param_grid[ele])) for ele in param_grid}

(Conclusion)

Congratulation! You have just learned how to use Hydra to experiment with different machine learning models. With this tool, you can keep your code separately from the specific information of your data.

恭喜你! 您刚刚学习了如何使用Hydra尝试不同的机器学习模型。 使用此工具,您可以将代码与数据的特定信息分开保存。

You don’t need to go back to your code if you want to change the hyperparameters of your machine learning model. You just need to add one more config file for an additional model and switch between them with

如果要更改机器学习模型的超参数,则无需返回代码。 您只需要为另一个模型添加一个配置文件,然后使用

python train.py model=<modelname>

Here is the example project that uses hydra.cc and config file.

这是使用hydra.cc和配置文件的示例项目 。

I like to write about basic data science concepts and play with different algorithms and data science tools. You could connect with me on LinkedIn and Twitter.

我喜欢写有关基本数据科学概念的文章,并喜欢使用不同的算法和数据科学工具。 您可以在LinkedIn和Twitter上与我联系。

Star this repo if you want to check out the codes for all of the articles I have written. Follow me on Medium to stay informed with my latest data science articles like these

如果您想查看我编写的所有文章的代码,请给此回购加注星号。 在Medium上关注我,以了解有关这些最新数据科学文章的最新信息

翻译自: https://towardsdatascience.com/3-steps-to-improve-your-efficiency-when-hypertuning-ml-models-5a579d57065e

ml模型