Canal解析binlog使用很普遍,如何把Canal中的binlog转为Spark的DataFrame还是涉及到一些技术问题的,下面由笔者一一道来,并以笔者的分析思路解决这个问题。 首先,先看一下Canal解析后的binlog json:

{    "data":[        {            "id":"1",            "name":"0",            "ts":"2020-05-15 11:45:55"        },        {            "id":"2",            "name":"0",            "ts":"2020-05-15 15:09:58"        }    ],    "database":"wow",    "es":1589526850000,    "id":54,    "isDdl":false,    "mysqlType":{        "id":"bigint(20)",        "name":"varchar(255)",        "ts":"datetime"    },    "old":[        {            "name":"1"        },        {            "name":"2"        }    ],    "pkNames":null,    "sql":"",    "sqlType":{        "id":-5,        "name":12,        "ts":93    },    "table":"test",    "ts":1589527037159,    "type":"UPDATE"}

字段说明:

data:数据列表,JSON数组,如果是插入则表示插入的数据,如果是更新,则表示更新后的最新数据,如果是删除,则表示被删除的数据database:数据库或schemaes:binlog executeTime,执行时间,13位的时间戳(Mysql端)id:事件操作的序列号,1,2,3…isDdl:是否是DDL操作mysqlType:mysql字段类型old:旧数据列表,用于update,size和data的size一一对应pkNames:kafka分区键sql:执行的sql,dml sql为空sqlType:java.sql.Types类型table:表名ts:dml build timeStamp, 同步时间(canal端)type:类型:INSERT/UPDATE/DELETE

可以看出,包含了mysql数据的变更和mysql的schema信息(缺少字段是否为空信息)。 笔者看完第一眼,首先想做的一件事,是把mysql的类型转为Spark的类型,但是仔细想想,这么多类型要适配,是不是Spark已经做了这件事呢?带着这个想法,找到了JDBCRDD,果然有这个方法:

/**   * Takes a (schema, table) specification and returns the table's Catalyst   * schema.   *   * @param options - JDBC options that contains url, table and other information.   *   * @return A StructType giving the table's Catalyst schema.   * @throws SQLException if the table specification is garbage.   * @throws SQLException if the table contains an unsupported type.   */  def resolveTable(options: JDBCOptions): StructType = {    val url = options.url    val table = options.tableOrQuery    val dialect = JdbcDialects.get(url)    val conn: Connection = JdbcUtils.createConnectionFactory(options)()    try {      val statement = conn.prepareStatement(dialect.getSchemaQuery(table))    try {        statement.setQueryTimeout(options.queryTimeout)    val rs = statement.executeQuery()    try {          JdbcUtils.getSchema(rs, dialect, alwaysNullable = true)        } finally {          rs.close()        }      } finally {        statement.close()      }    } finally {      conn.close()    }  }

然后进入JdbcUtils.getSchema:

/**   * Takes a [[ResultSet]] and returns its Catalyst schema.   *   * @param alwaysNullable If true, all the columns are nullable.   * @return A [[StructType]] giving the Catalyst schema.   * @throws SQLException if the schema contains an unsupported type.   */  def getSchema(      resultSet: ResultSet,      dialect: JdbcDialect,      alwaysNullable: Boolean = false): StructType = {    val rsmd = resultSet.getMetaData    val ncols = rsmd.getColumnCount    val fields = new Array[StructField](ncols)    var i = 0    while (i < ncols) {      val columnName = rsmd.getColumnLabel(i + 1)      val dataType = rsmd.getColumnType(i + 1)      val typeName = rsmd.getColumnTypeName(i + 1)      val fieldSize = rsmd.getPrecision(i + 1)      val fieldScale = rsmd.getScale(i + 1)      val isSigned = {        try {          rsmd.isSigned(i + 1)        } catch {           // Workaround for HIVE-14684:          case e: SQLException if          e.getMessage == "Method not supported" &&            rsmd.getClass.getName == "org.apache.hive.jdbc.HiveResultSetMetaData" => true        }      }       val nullable = if (alwaysNullable) {         true       } else {        rsmd.isNullable(i + 1) != ResultSetMetaData.columnNoNulls       }      val metadata = new MetadataBuilder().putLong("scale", fieldScale)      val columnType =        dialect.getCatalystType(dataType, typeName, fieldSize, metadata).getOrElse(          getCatalystType(dataType, fieldSize, fieldScale, isSigned))      fields(i) = StructField(columnName, columnType, nullable)      i = i + 1    }    new StructType(fields)  }

经过分析,发现它依赖的信息都可以从binlog中获取,除了字段是否可以为空,fieldSize、fieldScale、isSigned都可以解析出来,对于mysql,dialect也是确定的,找到它的实现类MySQLDialect,然后查看它的引用,在JdbcDialects中,通过解析url获取:

/**   * Fetch the JdbcDialect class corresponding to a given database url.   */  def get(url: String): JdbcDialect = {    val matchingDialects = dialects.filter(_.canHandle(url))    matchingDialects.length match {      case 0 => NoopDialect       case 1 => matchingDialects.head      case _ => new AggregatedDialect(matchingDialects)    }  }

dialect搞定了,那就只需要解决JdbcUtils.getCatalystType就好了,发现它是一个私有的方法,就想到了用反射实现,这个笔者花了一些时间,第一开始用类似java的方式实现反射,发现找不到方法,然后网上搜了下也是差不多的实现,于是笔者去scala官方文档去找reflection相关的内容,果然实现大有不同:https://docs.scala-lang.org/overviews/reflection/overview.html 笔者按照文档重新实现了一版:

lazy val getCatalystTypePrivate = {    import scala.reflect.runtime.{universe => ru}    val classMirror = ru.runtimeMirror(getClass.getClassLoader)    val JdbcUtils = classMirror.staticModule("org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils")    val methods = classMirror.reflectModule(JdbcUtils)    val instanceMirror = classMirror.reflect(methods.instance)    val method = methods.symbol.typeSignature.member(ru.TermName("getCatalystType")).asMethod    instanceMirror.reflectMethod(method)  }

搞定,下面来实现Spark的代码,说明请看注释:

//构造database table与主键映射,//假如一个批次有多条主键相同数据,去重取最后一条。//目前比较火的两个开源数据湖,插入更新Delta必须去重,Hudi可以不去重val tableKeyMap = Map("wow\001test" -> "id")//把Canal Json转为DataSetval canalDs = spark.createDataset[String](Array(canal)).as("data")//case class RecordInfo(key: String ,operator: String //,ts: Long ,db:String ,tb: String//,sqlType: JSONObject ,mysqlType: JSONObject ,row: JSONObject)//把Json转为RecordInfo,并去重取最后一条val dataSet = canalDs.rdd.flatMap { line =>  val canalObj = JSON.parseObject(line)  val rows = canalObj.getJSONArray("data")  val operator = canalObj.getString("type").toLowerCase  val db = canalObj.getString("database")  val tb = canalObj.getString("table")  val ts = canalObj.getLong("ts")  val sqlTypeObj = canalObj.getJSONObject("sqlType")  val mysqlTypeObj = canalObj.getJSONObject("mysqlType")  val id = tableKeyMap.get(s"${db}\001${tb}").get  rows.asInstanceOf[JSONArray].asScala.map(r => {    val row = r.asInstanceOf[JSONObject]    val key = s"${db}\001${tb}\001${row.getString(id)}"    RecordInfo(key, operator, ts, db, tb, sqlTypeObj, mysqlTypeObj, row)  })}.groupBy(_.key)  .map(records => {    val items = records._2.toSeq.sortBy(_.ts)    items.last  })//RecordInfo已经包含所有的Mysql Schema转换到Spark Schema的信息//根据RecordInfo可以计算出Spark的Schemaval schemaSet = dataSet.map(record => {  MysqlSchemaInfo(record.db, record.tb ,record.sqlType ,record.mysqlType)  }).distinct()  .groupBy(msi => DbInfo(msi.db ,msi.tb))  .map(record => {//假如用一个表字段变更,会有两条记录,需要合并Schema    val typeInfo = record._2.foldLeft((new JSONObject(),new JSONObject()))((jot ,msi)    => (jot._1.fluentPutAll(msi.sqlType) ,jot._2.fluentPutAll(msi.mysqlType)))    MysqlSchemaInfo(record._1.db ,record._1.tb ,typeInfo._1 ,typeInfo._2)  })  .map(msi =>{//为了避免不必要的mysql schema转换,去重之后转换    val schema = JdbcTypeUtils.getMysqlStructType(msi.sqlType, msi.mysqlType)    SchemaInfo(msi.db, msi.tb, schema)  }).collect()  //如果要对Schema做缓存,可以先collect然后在执行JdbcTypeUtils.getMysqlStructType这段操作,把Schema缓存到Driver端  //处理每一个database table  schemaSet.foreach { table => {//没有Schema的Row,注意fieldNames顺序,跟下面columns顺序一致      val tmpRDD = dataSet.filter(record => {        record.db == table.db && record.tb == table.tb      }).map(record => {        val array = table.schema.fieldNames.map(record.row.getString(_))        Row.fromSeq(array)      })//由于Canal解析的Json所有值都是字符串,//因此要先把Schema中的DataType转为StringType      val stringTypeSchema = JdbcTypeUtils.changeStructTypeToStringType(table.schema)//把StringType转为正确的DataType      val columns = table.schema.fields.map(sf => {        sf.dataType match {          case StringType => {            F.col(sf.name)          }          case _ => F.col(sf.name).cast(sf.dataType).as(sf.name)        }      })//根据StringType Schema创建DataFrame,然后转为正确的类型      val df = spark.createDataFrame(tmpRDD ,stringTypeSchema)        .select(columns: _*)//接下来就任由你处理了      df.collect().foreach(data => println(s"${table.db}.${table.tb}:" + data))    }  }

详细代码,请参见下属github地址,还有另一种实现,比较简单,就是需要连接MySql。是不是很通用,直接把binlog转为dataframe,使用更方便。