Spark生态系统的实例 python的生态系统

转载

mob64ca141834d3 2024-08-15 14:15:31

文章标签 Spark生态系统的实例编程语言 python 机器学习人工智能 文章分类 Spark 大数据

(Machine Learning with Python - Ecosystem)

(An Introduction to Python)

Python is a popular object-oriented programing language having the capabilities of high-level programming language. Its easy to learn syntax and portability capability makes it popular these days. The followings facts gives us the introduction to Python −

Python是一种流行的面向对象的编程语言，具有高级编程语言的功能。它易于学习的语法和可移植性功能使其近来很受欢迎。以下事实为我们提供了Python的介绍-

Python was developed by Guido van Rossum at Stichting Mathematisch Centrum in the Netherlands.
Python由荷兰Stichting Mathematisch Centrum的Guido van Rossum开发。
It was written as the successor of programming language named ‘ABC’.
它被编写为名为“ ABC”的编程语言的后继者。
It’s first version was released in 1991.
它的第一个版本于1991年发布。
The name Python was picked by Guido van Rossum from a TV show named Monty Python’s Flying Circus.
Python是Guido van Rossum在名为Monty Python's Flying Circus的电视节目中选择的。
It is an open source programming language which means that we can freely download it and use it to develop programs. It can be downloaded from www.python.org.
它是一种开放源代码编程语言，这意味着我们可以免费下载并使用它来开发程序。可以从www.python.org下载。
Python programming language is having the features of Java and C both. It is having the elegant ‘C’ code and on the other hand, it is having classes and objects like Java for object-oriented programming.
Python编程语言同时具有Java和C的功能。它具有优雅的“ C”代码，另一方面，具有诸如Java的类和对象用于面向对象的编程。
It is an interpreted language, which means the source code of Python program would be first converted into bytecode and then executed by Python virtual machine.
它是一种解释型语言，这意味着Python程序的源代码将首先转换为字节码，然后由Python虚拟机执行。

(Strengths and Weaknesses of Python)

Every programming language has some strengths as well as weaknesses, so does Python too.

每种编程语言都有其优点和缺点，Python也是如此。

(Strengths)

According to studies and surveys, Python is the fifth most important language as well as the most popular language for machine learning and data science. It is because of the following strengths that Python has −

根据研究和调查，Python是机器学习和数据科学中第五重要的语言，也是最受欢迎的语言。 Python具有以下优点：

Easy to learn and understand

易于学习和理解

Multi-purpose language

多用途语言

Huge number of modules

大量的模块

Support of open source community

支持开源社区

Scalability

可扩展性

(Weakness)

Although Python is a popular and powerful programming language, it has its own weakness of slow execution speed.

尽管Python是一种流行且功能强大的编程语言，但它也具有执行速度慢的缺点。

The execution speed of Python is slow as compared to compiled languages because Python is an interpreted language. This can be the major area of improvement for Python community.

与Python相比，Python的执行速度较慢，因为Python是一种解释型语言。这可能是Python社区需要改进的主要领域。

(Installing Python)

For working in Python, we must first have to install it. You can perform the installation of Python in any of the following two ways −

要使用Python工作，我们必须首先安装它。您可以通过以下两种方式之一执行Python的安装-

Installing Python individually
单独安装Python
Using Pre-packaged Python distribution − Anaconda
使用预打包的Python发行版-Anaconda

Let us discuss these each in detail.

让我们分别详细讨论这些。

(Installing Python Individually)

If you want to install Python on your computer, then then you need to download only the binary code applicable for your platform. Python distribution is available for Windows, Linux and Mac platforms.

如果要在计算机上安装Python，则只需下载适用于您的平台的二进制代码。 Python发行版适用于Windows，Linux和Mac平台。

The following is a quick overview of installing Python on the above-mentioned platforms −

以下是在上述平台上安装Python的快速概述-

On Unix and Linux platform

在Unix和Linux平台上

With the help of following steps, we can install Python on Unix and Linux platform −

借助以下步骤，我们可以在Unix和Linux平台上安装Python-

First, go to www.python.org/downloads/.
首先，请访问www.python.org/downloads/ 。
Next, click on the link to download zipped source code available for Unix/Linux.
接下来，单击链接以下载可用于Unix / Linux的压缩源代码。
Now, Download and extract files.
现在，下载并解压缩文件。
Next, we can edit the Modules/Setup file if we want to customize some options.
接下来，如果要自定义一些选项，我们可以编辑“模块/设置”文件。

Next, write the command run ./configure script
run ./configure脚本
make
使
make install
进行安装

On Windows platform

在Windows平台上

With the help of following steps, we can install Python on Windows platform −

借助以下步骤，我们可以在Windows平台上安装Python-

First, go to www.python.org/downloads/.
首先，请访问www.python.org/downloads/ 。
Next, click on the link for Windows installer python-XYZ.msi file. Here XYZ is the version we wish to install.
接下来，单击Windows安装程序python-XYZ.msi文件的链接。 XYZ是我们希望安装的版本。
Now, we must run the file that is downloaded. It will take us to the Python install wizard, which is easy to use. Now, accept the default settings and wait until the install is finished.
现在，我们必须运行下载的文件。它将带我们到易于使用的Python安装向导。现在，接受默认设置，并等待安装完成。

On Macintosh platform

在Macintosh平台上

For Mac OS X, Homebrew, a great and easy to use package installer is recommended to install Python 3. In case if you don't have Homebrew, you can install it with the help of following command −

对于Mac OS X，建议使用Homebrew易于使用的软件包安装程序来安装Python3。如果没有Homebrew，则可以在以下命令的帮助下进行安装-

$ ruby -e "$(curl -fsSL
https://raw.githubusercontent.com/Homebrew/install/master/install)"

It can be updated with the command below −

可以使用以下命令进行更新-

$ brew update

Now, to install Python3 on your system, we need to run the following command −

现在，要在您的系统上安装Python3，我们需要运行以下命令-

$ brew install python3

(Using Pre-packaged Python Distribution: Anaconda)

Anaconda is a packaged compilation of Python which have all the libraries widely used in Data science. We can follow the following steps to setup Python environment using Anaconda −

Anaconda是Python的打包版本，其中包含所有在数据科学中广泛使用的库。我们可以按照以下步骤使用Anaconda设置Python环境-

Step 1 − First, we need to download the required installation package from Anaconda distribution. The link for the same is www.anaconda.com/distribution/. You can choose from Windows, Mac and Linux OS as per your requirement.
步骤1-首先，我们需要从Anaconda发行版下载所需的安装包。相同的链接是www.anaconda.com/distribution/ 。您可以根据需要从Windows，Mac和Linux操作系统中进行选择。
Step 2
步骤2-接下来，选择要在计算机上安装的Python版本。最新的Python版本是3.7。在那里，您将同时获得64位和32位图形安装程序的选项。
Step 3
步骤3-选择操作系统和Python版本后，它将在您的计算机上下载Anaconda安装程序。现在，双击该文件，安装程序将安装Anaconda软件包。
Step 4
步骤4-要检查它是否已安装，请打开命令提示符并按如下所示键入Python-

Spark生态系统的实例 python的生态系统_编程语言

You can also check this in detailed video lecture at www.tutorialspoint.com/python_essentials_online_training/getting_started_with_anaconda.asp.

您也可以在www.tutorialspoint.com/python_essentials_online_training/getting_started_with_anaconda.asp上的详细视频讲座中查看此内容。

(Why Python for Data Science?)

Python is the fifth most important language as well as most popular language for Machine learning and data science. The following are the features of Python that makes it the preferred choice of language for data science −

Python是机器学习和数据科学中第五重要的语言，也是最受欢迎的语言。以下是Python的功能，使其成为数据科学语言的首选-

(Extensive set of packages)

Python has an extensive and powerful set of packages which are ready to be used in various domains. It also has packages like numpy, scipy, pandas, scikit-learn

numpy，scipy，pandas，scikit-learn等软件包，它们是机器学习和数据科学所需的。

(Easy prototyping)

Another important feature of Python that makes it the choice of language for data science is the easy and fast prototyping. This feature is useful for developing new algorithm.

Python的另一个重要特性使它成为数据科学语言的选择，这是简单而快速的原型制作。此功能对于开发新算法很有用。

(Collaboration feature)

The field of data science basically needs good collaboration and Python provides many useful tools that make this extremely.

数据科学领域基本上需要良好的协作，而Python提供了许多非常有用的工具。

(One language for many domains)

A typical data science project includes various domains like data extraction, data manipulation, data analysis, feature extraction, modelling, evaluation, deployment and updating the solution. As Python is a multi-purpose language, it allows the data scientist to address all these domains from a common platform.

一个典型的数据科学项目包括各个领域，例如数据提取，数据处理，数据分析，特征提取，建模，评估，部署和更新解决方案。由于Python是一种多用途语言，因此它允许数据科学家从一个通用平台访问所有这些领域。

(Components of Python ML Ecosystem)

In this section, let us discuss some core Data Science libraries that form the components of Python Machine learning ecosystem. These useful components make Python an important language for Data Science. Though there are many such components, let us discuss some of the importance components of Python ecosystem here −

在本节中，让我们讨论构成Python机器学习生态系统组件的一些核心数据科学库。这些有用的组件使Python成为数据科学的重要语言。尽管有很多这样的组件，但让我们在这里讨论Python生态系统的一些重要组件-

(Jupyter Notebook)

Jupyter notebooks basically provides an interactive computational environment for developing Python based Data Science applications. They are formerly known as ipython notebooks. The following are some of the features of Jupyter notebooks that makes it one of the best components of Python ML ecosystem −

Jupyter笔记本基本上为开发基于Python的Data Science应用程序提供了一个交互式计算环境。它们以前称为ipython笔记本。以下是Jupyter笔记本的一些功能，使其成为Python ML生态系统的最佳组件之一-

Jupyter notebooks can illustrate the analysis process step by step by arranging the stuff like code, images, text, output etc. in a step by step manner.
Jupyter笔记本可以通过逐步安排诸如代码，图像，文本，输出等内容来逐步说明分析过程。
It helps a data scientist to document the thought process while developing the analysis process.
它有助于数据科学家在开发分析过程时记录思想过程。
One can also capture the result as the part of the notebook.
人们还可以将结果记录为笔记本的一部分。
With the help of jupyter notebooks, we can share our work with a peer also.
借助jupyter笔记本，我们也可以与同行分享我们的工作。

(Installation and Execution)

If you are using Anaconda distribution, then you need not install jupyter notebook separately as it is already installed with it. You just need to go to Anaconda Prompt and type the following command −

如果您正在使用Anaconda发行版，则无需单独安装jupyter笔记本，因为它已经安装了。您只需要转到Anaconda Prompt并键入以下命令-

C:\>jupyter notebook

After pressing enter, it will start a notebook server at localhost:8888 of your computer. It is shown in the following screen shot −

按Enter键后，它将在您计算机的localhost：8888处启动一个笔记本服务器。在以下屏幕截图中显示-

Spark生态系统的实例 python的生态系统_Spark生态系统的实例_02

Now, after clicking the New tab, you will get a list of options. Select Python 3 and it will take you to the new notebook for start working in it. You will get a glimpse of it in the following screenshots −

现在，单击“新建”选项卡后，您将获得一个选项列表。选择Python 3，它将带您进入新笔记本以开始使用它。您将在以下屏幕截图中瞥见它-

Spark生态系统的实例 python的生态系统_python_03

Spark生态系统的实例 python的生态系统_机器学习_04

On the other hand, if you are using standard Python distribution then jupyter notebook can be installed using popular python package installer, pip.

另一方面，如果您使用的是标准Python发行版，则可以使用流行的python软件包安装程序pip安装jupyter notebook。

pip install jupyter

(Types of Cells in Jupyter Notebook)

The following are the three types of cells in a jupyter notebook −

以下是Jupyter笔记本中的三种单元格类型-

Code cells

代码单元

Markdown cells

降价单元

Raw cells

原始单元格

For more detailed study of jupyter notebook, you can go to the link www.tutorialspoint.com/jupyter/index.htm.

有关Jupyter Notebook的更详细研究，您可以转到链接www.tutorialspoint.com/jupyter/index.htm 。

(NumPy)

It is another useful component that makes Python as one of the favorite languages for Data Science. It basically stands for Numerical Python and consists of multidimensional array objects. By using NumPy, we can perform the following important operations −

它是另一个有用的组件，使Python成为数据科学最喜欢的语言之一。它基本上代表数值Python，由多维数组对象组成。通过使用NumPy，我们可以执行以下重要操作-

Mathematical and logical operations on arrays.
数组上的数学和逻辑运算。
Fourier transformation
傅立叶变换
Operations associated with linear algebra.
与线性代数相关的运算。

We can also see NumPy as the replacement of MatLab because NumPy is mostly used along with Scipy (Scientific Python) and Mat-plotlib (plotting library).

我们还可以看到NumPy替代了MatLab，因为NumPy通常与Scipy(科学Python)和Mat-plotlib(绘图库)一起使用。

Installation and Execution

安装与执行

If you are using Anaconda distribution, then no need to install NumPy separately as it is already installed with it. You just need to import the package into your Python script with the help of following −

如果使用的是Anaconda发行版，则无需单独安装NumPy，因为它已经安装了。您只需要在以下帮助下将包导入到您的Python脚本中-

import numpy as np

On the other hand, if you are using standard Python distribution then NumPy can be installed using popular python package installer, pip.

另一方面，如果您使用的是标准Python发行版，则可以使用流行的python软件包安装程序pip安装NumPy。

pip install NumPy

For more detailed study of NumPy, you can go to the link www.tutorialspoint.com/numpy/index.htm.

有关NumPy的更详细研究，您可以转到链接www.tutorialspoint.com/numpy/index.htm 。

(Pandas)

It is another useful Python library that makes Python one of the favorite languages for Data Science. Pandas is basically used for data manipulation, wrangling and analysis. It was developed by Wes McKinney in 2008. With the help of Pandas, in data processing we can accomplish the following five steps −

它是另一个有用的Python库，使Python成为数据科学最喜欢的语言之一。熊猫基本上用于数据处理，整理和分析。它是由Wes McKinney在2008年开发的。在Pandas的帮助下，在数据处理中，我们可以完成以下五个步骤-

Load
Prepare
Manipulate
Model
Analyze

(Data representation in Pandas)

The entire representation of data in Pandas is done with the help of following three data structures −

在以下三种数据结构的帮助下完成了Pandas中数据的完整表示-

Series

系列

1	5	10	15	24	25	28	36	40	89

1个	5	10	15	24	25	28	36	40	89

Data frame

数据框

Name	Roll number	Age	Gender
Aarav	1	15	Male
Harshit	2	14	Male
Kanika	3	16	Female
Mayank	4	15	Male

名称	卷号	年龄	性别
阿拉夫	1个	15	男
哈西特	2	14	男
卡尼卡	3	16	女
马扬克	4	15	男

Panel

面板

The following table gives us the dimension and description about above mentioned data structures used in Pandas −

下表为我们提供了有关熊猫中使用的上述数据结构的维度和说明-

Data Structure	Dimension	Description
Series	1-D	Size immutable, 1-D homogeneous data
DataFrames	2-D	Size Mutable, Heterogeneous data in tabular form
Panel	3-D	Size-mutable array, container of DataFrame.

数据结构	尺寸	描述
系列	一维	大小不可变的一维均匀数据
数据框	2维	表格形式的大小可变，异构数据
面板	3维	大小可变的数组，DataFrame的容器。

We can understand these data structures as the higher dimensional data structure is the container of lower dimensional data structure.

我们可以理解这些数据结构，因为高维数据结构是低维数据结构的容器。

(Installation and Execution)

If you are using Anaconda distribution, then no need to install Pandas separately as it is already installed with it. You just need to import the package into your Python script with the help of following −

如果您使用的是Anaconda发行版，则无需单独安装熊猫，因为它已经安装了它。您只需要在以下帮助下将包导入到您的Python脚本中-

import pandas as pd

On the other hand, if you are using standard Python distribution then Pandas can be installed using popular python package installer, pip.

另一方面，如果您使用的是标准Python发行版，则可以使用流行的python软件包安装程序pip安装Pandas。

pip install Pandas

After installing Pandas, you can import it into your Python script as did above.

安装Pandas之后，您可以像上面一样将其导入到Python脚本中。

(Example)

The following is an example of creating a series from ndarray by using Pandas −

以下是使用Pandas从ndarray创建系列的示例-

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: data = np.array(['g','a','u','r','a','v'])

In [4]: s = pd.Series(data)

In [5]: print (s)

0 g
1 a
2 u
3 r
4 a
5 v

dtype: object

For more detailed study of Pandas you can go to the link www.tutorialspoint.com/python_pandas/index.htm.

有关Pandas的详细研究，请访问链接www.tutorialspoint.com/python_pandas/index.htm 。

(Scikit-learn)

Another useful and most important python library for Data Science and machine learning in Python is Scikit-learn. The following are some features of Scikit-learn that makes it so useful −

Scikit-learn是用于Python中的数据科学和机器学习的另一个有用且最重要的python库。以下是Scikit学习的一些功能，使其变得非常有用-

It is built on NumPy, SciPy, and Matplotlib.
它基于NumPy，SciPy和Matplotlib构建。
It is an open source and can be reused under BSD license.
它是开源的，可以在BSD许可下重复使用。
It is accessible to everybody and can be reused in various contexts.
每个人都可以使用它，并且可以在各种环境中重复使用它。
Wide range of machine learning algorithms covering major areas of ML like classification, clustering, regression, dimensionality reduction, model selection etc. can be implemented with the help of it.
借助它，可以实现涵盖机器学习主要领域的广泛机器学习算法，例如分类，聚类，回归，降维，模型选择等。

(Installation and Execution)

If you are using Anaconda distribution, then no need to install Scikit-learn separately as it is already installed with it. You just need to use the package into your Python script. For example, with following line of script we are importing dataset of breast cancer patients from Scikit-learn

Scikit-learn导入乳腺癌患者的数据集-

from sklearn.datasets import load_breast_cancer

On the other hand, if you are using standard Python distribution and having NumPy and SciPy then Scikit-learn can be installed using popular python package installer, pip.

另一方面，如果您使用标准的Python发行版并具有NumPy和SciPy，则可以使用流行的python软件包安装程序pip安装Scikit-learn。