Canal解析binlog使用很普遍,如何把Canal中的binlog转为Spark的DataFrame还是涉及到一些技术问题的,下面由笔者一一道来,并以笔者的分析思路解决这个问题。
首先,先看一下Canal解析后的binlog json:
{ "data":[ { "id":"1", "name":"0", "ts":"2020-05-15 11:45:55" }, { "id":"2", "name":"0", "ts":"2020-05-15 15:09:58" } ], "database":"wow", "es":1589526850000, "id":54, "isDdl":false, "mysqlType":{ "id":"bigint(20)", "name":"varchar(255)", "ts":"datetime" }, "old":[ { "name":"1" }, { "name":"2" } ], "pkNames":null, "sql":"", "sqlType":{ "id":-5, "name":12, "ts":93 }, "table":"test", "ts":1589527037159, "type":"UPDATE"}
字段说明:
data:数据列表,JSON数组,如果是插入则表示插入的数据,如果是更新,则表示更新后的最新数据,如果是删除,则表示被删除的数据database:数据库或schemaes:binlog executeTime,执行时间,13位的时间戳(Mysql端)id:事件操作的序列号,1,2,3…isDdl:是否是DDL操作mysqlType:mysql字段类型old:旧数据列表,用于update,size和data的size一一对应pkNames:kafka分区键sql:执行的sql,dml sql为空sqlType:java.sql.Types类型table:表名ts:dml build timeStamp, 同步时间(canal端)type:类型:INSERT/UPDATE/DELETE
可以看出,包含了mysql数据的变更和mysql的schema信息(缺少字段是否为空信息)。
笔者看完第一眼,首先想做的一件事,是把mysql的类型转为Spark的类型,但是仔细想想,这么多类型要适配,是不是Spark已经做了这件事呢?带着这个想法,找到了JDBCRDD,果然有这个方法:
/** * Takes a (schema, table) specification and returns the table's Catalyst * schema. * * @param options - JDBC options that contains url, table and other information. * * @return A StructType giving the table's Catalyst schema. * @throws SQLException if the table specification is garbage. * @throws SQLException if the table contains an unsupported type. */ def resolveTable(options: JDBCOptions): StructType = { val url = options.url val table = options.tableOrQuery val dialect = JdbcDialects.get(url) val conn: Connection = JdbcUtils.createConnectionFactory(options)() try { val statement = conn.prepareStatement(dialect.getSchemaQuery(table)) try { statement.setQueryTimeout(options.queryTimeout) val rs = statement.executeQuery() try { JdbcUtils.getSchema(rs, dialect, alwaysNullable = true) } finally { rs.close() } } finally { statement.close() } } finally { conn.close() } }
然后进入JdbcUtils.getSchema:
/** * Takes a [[ResultSet]] and returns its Catalyst schema. * * @param alwaysNullable If true, all the columns are nullable. * @return A [[StructType]] giving the Catalyst schema. * @throws SQLException if the schema contains an unsupported type. */ def getSchema( resultSet: ResultSet, dialect: JdbcDialect, alwaysNullable: Boolean = false): StructType = { val rsmd = resultSet.getMetaData val ncols = rsmd.getColumnCount val fields = new Array[StructField](ncols) var i = 0 while (i < ncols) { val columnName = rsmd.getColumnLabel(i + 1) val dataType = rsmd.getColumnType(i + 1) val typeName = rsmd.getColumnTypeName(i + 1) val fieldSize = rsmd.getPrecision(i + 1) val fieldScale = rsmd.getScale(i + 1) val isSigned = { try { rsmd.isSigned(i + 1) } catch { // Workaround for HIVE-14684: case e: SQLException if e.getMessage == "Method not supported" && rsmd.getClass.getName == "org.apache.hive.jdbc.HiveResultSetMetaData" => true } } val nullable = if (alwaysNullable) { true } else { rsmd.isNullable(i + 1) != ResultSetMetaData.columnNoNulls } val metadata = new MetadataBuilder().putLong("scale", fieldScale) val columnType = dialect.getCatalystType(dataType, typeName, fieldSize, metadata).getOrElse( getCatalystType(dataType, fieldSize, fieldScale, isSigned)) fields(i) = StructField(columnName, columnType, nullable) i = i + 1 } new StructType(fields) }
经过分析,发现它依赖的信息都可以从binlog中获取,除了字段是否可以为空,fieldSize、fieldScale、isSigned都可以解析出来,对于mysql,dialect也是确定的,找到它的实现类MySQLDialect,然后查看它的引用,在JdbcDialects中,通过解析url获取:
/** * Fetch the JdbcDialect class corresponding to a given database url. */ def get(url: String): JdbcDialect = { val matchingDialects = dialects.filter(_.canHandle(url)) matchingDialects.length match { case 0 => NoopDialect case 1 => matchingDialects.head case _ => new AggregatedDialect(matchingDialects) } }
dialect搞定了,那就只需要解决JdbcUtils.getCatalystType就好了,发现它是一个私有的方法,就想到了用反射实现,这个笔者花了一些时间,第一开始用类似java的方式实现反射,发现找不到方法,然后网上搜了下也是差不多的实现,于是笔者去scala官方文档去找reflection相关的内容,果然实现大有不同:https://docs.scala-lang.org/overviews/reflection/overview.html
笔者按照文档重新实现了一版:
lazy val getCatalystTypePrivate = { import scala.reflect.runtime.{universe => ru} val classMirror = ru.runtimeMirror(getClass.getClassLoader) val JdbcUtils = classMirror.staticModule("org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils") val methods = classMirror.reflectModule(JdbcUtils) val instanceMirror = classMirror.reflect(methods.instance) val method = methods.symbol.typeSignature.member(ru.TermName("getCatalystType")).asMethod instanceMirror.reflectMethod(method) }
搞定,下面来实现Spark的代码,说明请看注释:
//构造database table与主键映射,//假如一个批次有多条主键相同数据,去重取最后一条。//目前比较火的两个开源数据湖,插入更新Delta必须去重,Hudi可以不去重val tableKeyMap = Map("wow\001test" -> "id")//把Canal Json转为DataSetval canalDs = spark.createDataset[String](Array(canal)).as("data")//case class RecordInfo(key: String ,operator: String //,ts: Long ,db:String ,tb: String//,sqlType: JSONObject ,mysqlType: JSONObject ,row: JSONObject)//把Json转为RecordInfo,并去重取最后一条val dataSet = canalDs.rdd.flatMap { line => val canalObj = JSON.parseObject(line) val rows = canalObj.getJSONArray("data") val operator = canalObj.getString("type").toLowerCase val db = canalObj.getString("database") val tb = canalObj.getString("table") val ts = canalObj.getLong("ts") val sqlTypeObj = canalObj.getJSONObject("sqlType") val mysqlTypeObj = canalObj.getJSONObject("mysqlType") val id = tableKeyMap.get(s"${db}\001${tb}").get rows.asInstanceOf[JSONArray].asScala.map(r => { val row = r.asInstanceOf[JSONObject] val key = s"${db}\001${tb}\001${row.getString(id)}" RecordInfo(key, operator, ts, db, tb, sqlTypeObj, mysqlTypeObj, row) })}.groupBy(_.key) .map(records => { val items = records._2.toSeq.sortBy(_.ts) items.last })//RecordInfo已经包含所有的Mysql Schema转换到Spark Schema的信息//根据RecordInfo可以计算出Spark的Schemaval schemaSet = dataSet.map(record => { MysqlSchemaInfo(record.db, record.tb ,record.sqlType ,record.mysqlType) }).distinct() .groupBy(msi => DbInfo(msi.db ,msi.tb)) .map(record => {//假如用一个表字段变更,会有两条记录,需要合并Schema val typeInfo = record._2.foldLeft((new JSONObject(),new JSONObject()))((jot ,msi) => (jot._1.fluentPutAll(msi.sqlType) ,jot._2.fluentPutAll(msi.mysqlType))) MysqlSchemaInfo(record._1.db ,record._1.tb ,typeInfo._1 ,typeInfo._2) }) .map(msi =>{//为了避免不必要的mysql schema转换,去重之后转换 val schema = JdbcTypeUtils.getMysqlStructType(msi.sqlType, msi.mysqlType) SchemaInfo(msi.db, msi.tb, schema) }).collect() //如果要对Schema做缓存,可以先collect然后在执行JdbcTypeUtils.getMysqlStructType这段操作,把Schema缓存到Driver端 //处理每一个database table schemaSet.foreach { table => {//没有Schema的Row,注意fieldNames顺序,跟下面columns顺序一致 val tmpRDD = dataSet.filter(record => { record.db == table.db && record.tb == table.tb }).map(record => { val array = table.schema.fieldNames.map(record.row.getString(_)) Row.fromSeq(array) })//由于Canal解析的Json所有值都是字符串,//因此要先把Schema中的DataType转为StringType val stringTypeSchema = JdbcTypeUtils.changeStructTypeToStringType(table.schema)//把StringType转为正确的DataType val columns = table.schema.fields.map(sf => { sf.dataType match { case StringType => { F.col(sf.name) } case _ => F.col(sf.name).cast(sf.dataType).as(sf.name) } })//根据StringType Schema创建DataFrame,然后转为正确的类型 val df = spark.createDataFrame(tmpRDD ,stringTypeSchema) .select(columns: _*)//接下来就任由你处理了 df.collect().foreach(data => println(s"${table.db}.${table.tb}:" + data)) } }
详细代码,请参见下属github地址,还有另一种实现,比较简单,就是需要连接MySql。是不是很通用,直接把binlog转为dataframe,使用更方便。