spark 多个dataframe合并 spark dataframe foreach

转载

mob64ca14163a4f 2023-09-11 21:21:50

文章标签 spark 多个dataframe合并连接池数据 spark 文章分类 Spark 大数据

一、简单介绍

DStream.foreachRDD()方法实际上是Spark流处理的一个处理及输出RDD的方法。这个方法使我们能够访问底层的DStream对应的RDD进而根据我们需要的逻辑对其进行处理。例如，我们可以通过foreachRDD()方法来访问每一条mini-batch中的数据，然后将它们存入数据库。

需要注意的是：

DStream.foreachRDD()传给我们的参数是一个RDD[userbatch]，而不是单个的user，因此我们需要进一步循环来根据需要处理其中每一个user。

我们不能用传统的for ele in iterable方法来循环其中的元素。因此我们需要用rdd.foreach()来分别处理其中每一个user。

进一步分析Spark的流处理过程：我们拥有几个Spark的executor。对于稳定到来的数据流，Spark Streaming负责根据一定的时间间隔将流输入分成小的数据块（batch），然后Spark将这些小数据块（mini-batch）分配给不同的executor。通过这样的操作，Spark实现了并行的数据计算，从而加速了数据处理的速度。

二、源码分析

源码如下：

/**
   * Apply a function to each RDD in this DStream. This is an output operator, so
   * 'this' DStream will be registered as an output stream and therefore materialized.
   */
  def foreachRDD(foreachFunc: (RDD[T], Time) => Unit): Unit = ssc.withScope {
    // because the DStream is reachable from the outer object here, and because
    // DStreams can't be serialized with closures, we can't proactively check
    // it for serializability and so we pass the optional false to SparkContext.clean
    foreachRDD(foreachFunc, displayInnerRDDOps = true)
  }

  /**
   * Apply a function to each RDD in this DStream. This is an output operator, so
   * 'this' DStream will be registered as an output stream and therefore materialized.
   * @param foreachFunc foreachRDD function
   * @param displayInnerRDDOps Whether the detailed callsites and scopes of the RDDs generated
   *                           in the `foreachFunc` to be displayed in the UI. If `false`, then
   *                           only the scopes and callsites of `foreachRDD` will override those
   *                           of the RDDs on the display.
   */
  private def foreachRDD(
      foreachFunc: (RDD[T], Time) => Unit,
      displayInnerRDDOps: Boolean): Unit = {
    new ForEachDStream(this,
      context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
  }

具体只能浅显的看出，该方法获取了DStream对象的RDD，对每个RDD进行处理后，不返回值，通常是将RDD处理后写入数据库或传递至其他下游继续处理。

三、正确使用方法

误区一：在driver上创建连接对象（比如网络连接或数据库连接）

如果在driver上创建连接对象，然后在RDD的算子函数内使用连接对象，那么就意味着需要将连接对象序列化后从driver传递到worker上。而连接对象（比如Connection对象）通常来说是不支持序列化的，此时通常会报序列化的异常（serialization errors）。因此连接对象必须在worker上创建，不要在driver上创建。

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // 在driver上执行
  rdd.foreach { record =>
    connection.send(record) // 在worker上执行
  }
}

误区二：为每一条记录都创建一个连接对象

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

通常来说，连接对象的创建和销毁都是很消耗时间的。因此频繁地创建和销毁连接对象，可能会导致降低spark作业的整体性能和吞吐量。

正确做法一：为每个RDD分区创建一个连接对象

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

比较正确的做法是：对DStream中的RDD，调用foreachPartition，对RDD中每个分区创建一个连接对象，使用一个连接对象将一个分区内的数据都写入底层MySQL中。这样可以大大减少创建的连接对象的数量。

正确做法二：为每个RDD分区使用一个连接池中的连接对象

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // 静态连接池，同时连接是懒创建的
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // 用完以后将连接返回给连接池，进行复用
  }
}

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。