Spark-SQL-core
@(spark)[sql|execution]
整个spark-sql的作用就是完成SQL语句到spark api的转换过程。整体的流程见SQLContext一节的分析。
SQLContext
/**
* The entry point for working with structured data (rows and columns) in Spark. Allows the
* creation of [[DataFrame]] objects as well as the execution of SQL queries.
*
*/
class SQLContext(@transient val sparkContext: SparkContext)
extends org.apache.spark.Logging
with Serializable {
SQLContext不是从SparkContext继承来的,而是一个单独的class
做了下面的事情:
* @groupname basic Basic Operations
* @groupname ddl_ops Persistent Catalog DDL
* @groupname cachemgmt Cached Table Management
* @groupname genericdata Generic Data Sources
* @groupname specificdata Specific Data Sources
* @groupname config Configuration
* @groupname dataframes Custom DataFrame Creation
* @groupname Ungrouped Support functions for language integrated queries.
SQLContext的member也很多,比如:catalog,ddlParser,sqlParser,optimizer等等,都是用来具体干活的。完成SQL的逻辑
parse之后的结果会放到一个LogicalPlan里面去,
在获得LogicalPlan之后,最核心的逻辑在于:
/**
* :: DeveloperApi ::
* The primary workflow for executing relational queries using Spark. Designed to allow easy
* access to the intermediate phases of query execution for developers.
*/
@DeveloperApi
protected[sql] class QueryExecution(val logical: LogicalPlan) {
def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)
// resolve 和简单机械的优化
lazy val analyzed: LogicalPlan = analyzer(logical)
// 取cached plan
lazy val withCachedData: LogicalPlan = {
assertAnalyzed()
cacheManager.useCachedData(analyzed)
}
// 查询优化
lazy val optimizedPlan: LogicalPlan = optimizer(withCachedData)
// TODO: Don't just pick the first one...
// 生成spark的plan,完成logicalPlan-> PhysicalPlan的转化。
// 按传统关系数据库的做法,这个地方应该是引入cost based的优化方法,目前看似乎没有。。。
lazy val sparkPlan: SparkPlan = {
SparkPlan.currentContext.set(self)
planner(optimizedPlan).next()
}
// executedPlan should not be used to initialize any SparkPlan. It should be
// only used for execution.
// 插入shuffle的过程
lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)
做几点详细说明:
- parser 先做ddlparser如果失败那么就做sqlparser的parse
protected[sql] def parseSql(sql: String): LogicalPlan = {
ddlParser(sql, false).getOrElse(sqlParser(sql))
}
- analyzer
protected[sql] lazy val analyzer: Analyzer =
new Analyzer(catalog, functionRegistry, caseSensitive = true) {
override val extendedResolutionRules =
ExtractPythonUdfs ::
sources.PreInsertCastAndRename ::
Nil
override val extendedCheckRules = Seq(
sources.PreWriteCheck(catalog)
)
}
- Optimizer = DefaultOptimizer
object DefaultOptimizer extends Optimizer {
val batches =
// SubQueries are only needed for analysis and can be removed before execution.
Batch("Remove SubQueries", FixedPoint(100),
EliminateSubQueries) ::
Batch("Combine Limits", FixedPoint(100),
CombineLimits) ::
Batch("ConstantFolding", FixedPoint(100),
NullPropagation,
ConstantFolding,
LikeSimplification,
BooleanSimplification,
SimplifyFilters,
SimplifyCasts,
SimplifyCaseConversionExpressions,
OptimizeIn) ::
Batch("Decimal Optimizations", FixedPoint(100),
DecimalAggregates) ::
Batch("Filter Pushdown", FixedPoint(100),
UnionPushdown,
CombineFilters,
PushPredicateThroughProject,
PushPredicateThroughJoin,
PushPredicateThroughGenerate,
ColumnPruning) ::
Batch("LocalRelation", FixedPoint(100),
ConvertToLocalRelation) :: Nil
}
- planner
protected[sql] class SparkPlanner extends SparkStrategies {
val sparkContext: SparkContext = self.sparkContext
def strategies: Seq[Strategy] =
experimental.extraStrategies ++ (
DataSourceStrategy ::
DDLStrategy ::
TakeOrdered ::
HashAggregation ::
LeftSemiJoin ::
HashJoin ::
InMemoryScans ::
ParquetOperations ::
BasicOperators ::
CartesianProduct ::
BroadcastNestedLoopJoin :: Nil)
```
1. prepareForExecution,其中AddExchange的工作就是增加必要的shuffle
```
/** Prepares a planned SparkPlan for execution by inserting shuffle operations as needed.
*/
@transient
protected[sql] val prepareForExecution = new RuleExecutor[SparkPlan] {
val batches =
Batch("Add exchange", Once, AddExchange(self)) :: Nil
}
以上这些optimizer,planner等实际上都是ruleEngine,ruleEngine的apply的过程实际上就是按着某种顺序(先根/后根)遍历整棵LogicalPlan Tree,在某个LogicalPlan上使用具体的rule来完成操作。
和SparkCore的结合
最终的结果是得到一个SparkPlan
1. abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {
,它有个重要的函数就是 def execute(): RDD[Row]
1. 继承它的trait包括LeafNode, UnaryNode, BinaryNode
而上面生成的所有Physial Plan都extends了上面的接口,也就是说:Physical Plan 相当于把逻辑作用在一系列的RDD[Row]上面。
1. invoke Physical Plan Tree的root node就会invoke整棵树的execute。
得到的Physical Plan Tree可以理解为若干段spark core的代码。不严格说,类似于生成如下代码:
val Input = sc.textFile(...)
// Filter
val rows = Input.map(....)
// Project
rows = rows.map(...)
// Join, Exchange
rows = rows.repartition(...)
// Project
rows = rows.map(...)
Summary
- 整个spark-sql的作用就是完成SQL语句到spark api的转换过程。
- DataFrame封装了数据+schema,更像一个逻辑上的表的概念。
Source
A set of APIs for adding data sources to Spark SQL.
interfaces
一堆interface,调重要的描述一下
/**
* ::DeveloperApi::
* Implemented by objects that produce relations for a specific kind of data source. When
* Spark SQL is given a DDL operation with a USING clause specified (to specify the implemented
* RelationProvider), this interface is used to pass in the parameters specified by a user.
*
* Users may specify the fully qualified class name of a given data source. When that class is
* not found Spark SQL will append the class name `DefaultSource` to the path, allowing for
* less verbose invocation. For example, 'org.apache.spark.sql.json' would resolve to the
* data source 'org.apache.spark.sql.json.DefaultSource'
*
* A new instance of this class with be instantiated each time a DDL call is made.
*/
@DeveloperApi
trait RelationProvider {
/**
* Returns a new base relation with the given parameters.
* Note: the parameters' keywords are case insensitive and this insensitivity is enforced
* by the Map that is passed to the function.
*/
def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation
}
获得relation
/**
* ::DeveloperApi::
* Represents a collection of tuples with a known schema. Classes that extend BaseRelation must
* be able to produce the schema of their data in the form of a [[StructType]]. Concrete
* implementation should inherit from one of the descendant `Scan` classes, which define various
* abstract methods for execution.
*
* BaseRelations must also define a equality function that only returns true when the two
* instances will return the same data. This equality function is used when determining when
* it is safe to substitute cached results for a given relation.
*/
@DeveloperApi
abstract class BaseRelation {
def sqlContext: SQLContext
def schema: StructType
/**
* Returns an estimated size of this relation in bytes. This information is used by the planner
* to decided when it is safe to broadcast a relation and can be overridden by sources that
* know the size ahead of time. By default, the system will assume that tables are too
* large to broadcast. This method will be called multiple times during query planning
* and thus should not perform expensive operations for each invocation.
*
* Note that it is always better to overestimate size than underestimate, because underestimation
* could lead to execution plans that are suboptimal (i.e. broadcasting a very large table).
*/
def sizeInBytes: Long = sqlContext.conf.defaultSizeInBytes
}
对relation的抽象
还有一些XXXScan。
DataSourceStrategy
这里我理解就是一个翻译的过程,把逻辑计划和baseRelatiuon联系起来。
Filter
A filter predicate for data sources.
包含一些常见的操作比如 > = <等等。
ddl
这里也包含了一个parser和SQL Parser不同,这里主要是DDL语句,不过目前比较少了。
Parser[LogicalPlan] = createTable | describeTable | refreshTable
rules
主要有两个rule
/**
* A rule to do pre-insert data type casting and field renaming. Before we insert into
* an [[InsertableRelation]], we will use this rule to make sure that
* the columns to be inserted have the correct data type and fields have the correct names.
*/
private[sql] object PreInsertCastAndRename extends Rule[LogicalPlan] {
/**
* A rule to do various checks before inserting into or writing to a data source table.
*/
private[sql] case class PreWriteCheck(catalog: Catalog) extends (LogicalPlan => Unit) {
JDBC
jdbc相关的东西,没什么太特殊的
JDBCRDD
真的去query数据
JDBCRelation
jdbc
* Saves a partition of a DataFrame to the JDBC database. This is done in
* a single database transaction in order to avoid repeatedly inserting
* data as much as possible.
*
* It is still theoretically possible for rows in a DataFrame to be
* inserted into the database more than once if a stage somehow fails after
* the commit occurs but before the stage can return successfully.
*
* This is not a closure inside saveTable() because apparently cosmetic
* implementation changes elsewhere might easily render such a closure
* non-Serializable. Instead, we explicitly close over all variables that
* are used.
*/
def savePartition(url: String, table: String, iterator: Iterator[Row],
rddSchema: StructType, nullTypes: Array[Int]): Iterator[Byte] = {
DriverQuirks
quirk是怪癖的意思,实际上这个trait用来处理具体数据库和jdbc标准不一致的地方的。
/**
* Encapsulates workarounds for the extensions, quirks, and bugs in various
* databases. Lots of databases define types that aren't explicitly supported
* by the JDBC spec. Some JDBC drivers also report inaccurate
* information---for instance, BIT(n>1) being reported as a BIT type is quite
* common, even though BIT in JDBC is meant for single-bit values. Also, there
* does not appear to be a standard name for an unbounded string or binary
* type; we use BLOB and CLOB by default but override with database-specific
* alternatives when these are absent or do not behave correctly.
*
* Currently, the only thing DriverQuirks does is handle type mapping.
* `getCatalystType` is used when reading from a JDBC table and `getJDBCType`
* is used when writing to a JDBC table. If `getCatalystType` returns `null`,
* the default type handling is used for the given JDBC type. Similarly,
* if `getJDBCType` returns `(null, None)`, the default type handling is used
* for the given Catalyst type.
*/
private[sql] abstract class DriverQuirks {
def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): DataType
def getJDBCType(dt: DataType): (String, Option[Int])
}
JSON
没太多好说的,处理Json,抢mongoDB他们的饭吃。ms还不支持partition。
SparkSQLParser
/**
* The top level Spark SQL parser. This parser recognizes syntaxes that are available for all SQL
* dialects supported by Spark SQL, and delegates all the other syntaxes to the `fallback` parser.
*
* @param fallback A function that parses an input string to a logical plan
*/
private[sql] class SparkSQLParser(fallback: String => LogicalPlan) extends AbstractSparkSQLParser {
处理的语句包括:
override protected lazy val start: Parser[LogicalPlan] = cache | uncache | set | show | others
Column
A column in a [[DataFrame]]
DataFrame
A distributed collection of data organized into named columns.
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
A DataFrame is equivalent to a relational table in Spark SQL. There are multiple ways to create a DataFrame:
// Create a DataFrame from Parquet files
val people = sqlContext.parquetFile(“…”)
// Create a DataFrame from data sources
val df = sqlContext.load("...", "json")
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.
To select a column from the data frame, use apply method in Scala and col in Java.
val ageCol = people("age") // in Scala
Column ageCol = people.col("age") // in Java
其函数基本分成两类:
1. 立即执行,进行真正的计算,比如override def take(n: Int): Array[Row] = head(n)
2. 不立即执行的,比如:
def join(right: DataFrame): DataFrame = {
Join(logicalPlan, right.logicalPlan, joinType = Inner, None)
}
就不立即执行了,生成logicPlan,需要1类的时候一起执行
DataFrameNaFunctions
/**
* :: Experimental ::
* Functionality for working with missing data in [[DataFrame]]s.
*/
@Experimental
final class DataFrameNaFunctions private[sql](df: DataFrame) {
functions
真的就是一堆fuctions
parquet
todo
内容还不少,窃以为从原理上spark和parquet还算相配
CacheManager
没什么特殊的,就是cache的增删改查
/**
* Provides support in a SQLContext for caching query results and automatically using these cached
* results when subsequent queries are executed. Data is cached using byte buffers stored in an
* InMemoryRelation. This relation is automatically substituted query plans that return the
* `sameResult` as the originally cached query.
*
* Internal to Spark SQL.
*/
private[sql] class CacheManager(sqlContext: SQLContext) extends Logging {
execution
/**
* :: DeveloperApi ::
* An execution engine for relational query plans that runs on top Spark and returns RDDs.
*
* Note that the operators in this package are created automatically by a query planner using a
* [[SQLContext]] and are not intended to be used directly by end users of Spark SQL. They are
* documented here in order to make it easier for others to understand the performance
* characteristics of query plans that are generated by Spark SQL.
*/
package object execution
SparkPlan
@DeveloperApi
abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {
Aggregate
处理和aggregate相关的,注意其execute函数,分为group by empty和group by not empty分别计算。
本质上就是一个hash based group 方法
GeneratedAggregate
/**
* :: DeveloperApi ::
* Alternate version of aggregation that leverages projection and thus code generation.
* Aggregations are converted into a set of projections from a aggregation buffer tuple back onto
* itself. Currently only used for simple aggregations like SUM, COUNT, or AVERAGE are supported.
*
* @param partial if true then aggregation is done partially on local data without shuffling to
* ensure all values where `groupingExpressions` are equal are present.
* @param groupingExpressions expressions that are evaluated to determine grouping.
* @param aggregateExpressions expressions that are computed for each group.
* @param child the input data source.
*/
@DeveloperApi
case class GeneratedAggregate(
partial: Boolean,
groupingExpressions: Seq[Expression],
aggregateExpressions: Seq[NamedExpression],
child: SparkPlan)
extends UnaryNode {
commands
各种各样的command,比如set , explain,desc啥的。
basicOperators
project, filter,sort等等
注意这里没有join
Exchange
这是一个非常重要的操作,其用途是切换数据的分布方式
ExistingRDD
从一个RDD里得到关系
Expand
/**
* Apply the all of the GroupExpressions to every input row, hence we will get
* multiple output rows for a input row.
* @param projections The group of expressions, all of the group expressions should
* output the same schema specified bye the parameter `output`
* @param output The output Schema
* @param child Child operator
*/
@DeveloperApi
case class Expand(
Generate
/**
* :: DeveloperApi ::
* Applies a [[catalyst.expressions.Generator Generator]] to a stream of input rows, combining the
* output of each into a new stream of rows. This operation is similar to a `flatMap` in functional
* programming with one important additional feature, which allows the input rows to be joined with
* their output.
* @param join when true, each output row is implicitly joined with the input tuple that produced
* it.
* @param outer when true, each input row will be output at least once, even if the output of the
* given `generator` is empty. `outer` has no effect when `join` is false.
*/
@DeveloperApi
case class Generate(
generator: Generator,
join: Boolean,
outer: Boolean,
child: SparkPlan)
extends UnaryNode {
LocalTableScan
/**
* Physical plan node for scanning data from a local collection.
*/
case class LocalTableScan(output: Seq[Attribute], rows: Seq[Row]) extends LeafNode {
SparkSqlSerializer
private[sql] class SparkSqlSerializer(conf: SparkConf) extends KryoSerializer(conf) {
SparkStrategies
逻辑
joins
join在功能上还是做的比较完善的,比如left join,semi join,out join基本都有,各种组合也都有。
1. 如果基于同样的分布的话,那么就是具体的join
2. 如果不行的,就broadcast某个表
3. 再不行的话就只能做shuffle了。
/**
* :: DeveloperApi ::
* Physical execution operators for join operations.
*/
package object joins {
HashedRelation
/**
* Interface for a hashed relation by some key. Use [[HashedRelation.apply]] to create a concrete
* object.
*/
private[joins] sealed trait HashedRelation {
def get(key: Row): CompactBuffer[Row]
}
主要包括GeneralHashedRelation和UniqueKeyHashedRelation
CartesianProduct
HashOuterJoin
叫这个名字,不过包含了LeftOutJoin,RightOutJoin,FullOutJoin
Broadcast*
BroadcastHashJoin
BroadcastLeftSemiJoinHash
BroadcastNestedLoopJoin
LeftSemiJoinBNL
/**
* :: DeveloperApi ::
* Using BroadcastNestedLoopJoin to calculate left semi join result when there's no join keys
* for hash join.
*/
@DeveloperApi
case class LeftSemiJoinBNL(
streamed: SparkPlan, broadcast: SparkPlan, condition: Option[Expression])
extends BinaryNode {
LeftSemiJoinHash
/**
* :: DeveloperApi ::
* Build the right table's join keys into a HashSet, and iteratively go through the left
* table, to find the if join keys are in the Hash set.
*/
@DeveloperApi
case class LeftSemiJoinHash(
leftKeys: Seq[Expression],
rightKeys: Seq[Expression],
left: SparkPlan,
right: SparkPlan) extends BinaryNode with HashJoin {
/**
* :: DeveloperApi ::
* Performs an inner hash join of two child relations by first shuffling the data using the join
* keys.
*/
@DeveloperApi
case class ShuffledHashJoin(
leftKeys: Seq[Expression],
rightKeys: Seq[Expression],
buildSide: BuildSide,
left: SparkPlan,
right: SparkPlan)
extends BinaryNode with HashJoin {