应用场景

  • 为了能在jupyter中开发spark程序,博文记录在​​jupyter​​​ 中配置​​spark​​ 开发环境过程。
  • 参考很多博客无法有效搭建 jupyter 中spark开发环境!

必备组件

  • spark 下载
  • ​Apache Toree has one main goal: provide the foundation for interactive applications to connect and use Apache Spark.​
  • ​下载地址​
  • 标注 :
[root@localhost bin]# cat /etc/redhat-release
CentOS Linux release 7.3.1611 (Core)

安装命令

  • 在线 install
  • anaconda 环境变量已在Linux PATH环境变量中,没有则配置;或者切换至Anaconda bin目录下,利用pip命令安装、配置。
# your-spark-home : spark安装包路径
pip install toree
jupyter toree install
  • 离线 install
  • 下载 GitHub 源代码、tar包均可实现离线安装。
  • 源代码安装
  • ​/root/anaconda2/bin/python setup.py install​
  • ​jupyter toree install --spark_home=your-spark-home​

测试代码

  • 测试环境是否搭建成功
import org.apache.spark.sql.SparkSession

object sparkSqlDemo {
val sparkSession = SparkSession.builder().
master("local[1]")
.appName("spark session example")
.getOrCreate()

def main(args: Array[String]) {
val input = sparkSession.read.json("cars1.json")
input.createOrReplaceTempView("Cars1")
val result = sparkSession.sql("select * from Cars1")
result.show()
}
}

sparkSqlDemo.main(Array()) # 调用方法
  • 执行结果
  • jupyter spark环境配置(在线、离线均可实现)_jupyter


扩展 : 安装多内核

  • Installing Multiple Kernels
  • Options
  • ​--interpreters=<Unicode> (ToreeInstall.interpreters)
    Default: 'Scala'
    A comma separated list of the interpreters to install. The names of the
    interpreters are case sensitive.​
jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL

References


FAQs