一、读取日志格式
使用的测试数据是Tomcat的访问日志,日志格式如下
192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/ HTTP/1.1" 200 259
192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/head.jsp HTTP/1.1" 200 713
192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/body.jsp HTTP/1.1" 200 240
192.168.88.1 - - [30/Jul/2017:12:54:37 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
二、使用SparkCore获取日志并进行解析
目的:将每一行的日志中的jsp页面的访问记录存入数据库(MySQL,Oracle)。
解析:由于日志具有一定的格式,所以对下面的日志进行截取(String的操作),获得head.jsp。最后将(head.jsp,1)存入数据库的表中。
192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/head.jsp HTTP/1.1" 200 713
三、日志解析代码
def main(args: Array[String]): Unit = {
//定义SparkContext对象
val conf = new SparkConf().setAppName("MyWebLogDemo").setMaster("local")
val sc = new SparkContext(conf)
// 读入日志文件
//rdd1结果 :(hadoop.jsp,1)
val rdd1 = sc.textFile("D:\\localhost_access_log.2017-07-30.txt").map{
//line: 相当于value1
line => {
//处理该行日志: 192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/head.jsp HTTP/1.1" 200 713
//解析字符串,找到jsp的名字
//第一步解析出:GET /MyDemoWeb/head.jsp HTTP/1.1
val index1 = line.indexOf("\"") //第一个双引号的位置
val index2 = line.lastIndexOf("\"") //第二个双引号的位置
val str1 = line.substring(index1+1,index2)
//第二步解析出:/MyDemoWeb/head.jsp
val index3 = str1.indexOf(" ")
val index4 = str1.lastIndexOf(" ")
val str2 = str1.substring(index3+1, index4)
//第三步解析出: head.jsp
val jspName = str2.substring(str2.lastIndexOf("/")+1)
//返回 (hadoop.jsp,1) ---> 保存到Oracle中
(jspName,1)
}
}
四、对每一个分区进行数据的存储到数据库
//针对分区,创建Connection,将结果保存到数据库中
rdd1.foreachPartition(saveToOracle)
//定义一个函数,将某一个分区中的数据(jspname,count)保存到Oracle中
def saveToOracle(it:Iterator[(String,Int)]) ={
var conn:Connection = null
var pst:PreparedStatement = null
try{
//创建一个Connection
//conn = DriverManager.getConnection("jdbc:oracle:thin:@10.30.30.210:1521/orcl.example.com","scott", "tiger")//插入oracle数据库
conn = DriverManager.getConnection("jdbc:mysql://10.30.3.104:3306/test","root", "2561304")//插入mysql数据库
pst = conn.prepareStatement("insert into result values(?,?)")
//把rdd1的一个分区中的数据插入到Oracle中
it.foreach(data =>{
pst.setString(1, data._1)
pst.setInt(2,data._2)
pst.executeUpdate()
})
}catch{
case e1:Exception => e1.printStackTrace()
}finally{
if(pst != null) pst.close()
if(conn != null) conn.close()
}
}
五、登录数据库并创建表()
MariaDB [test]> create table result(jspname varchar(20),count Int(4));
Query OK, 0 rows affected (0.13 sec)
此时查看表中的数据为空:
MariaDB [test]> select * from result;
Empty set (0.00 sec)
运行代码后再次查看:
MariaDB [test]> select * from result;
六、测试日志和原理解析
测试日志的完整内容如下:D:\\localhost_access_log.2017-07-30.txt
192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/ HTTP/1.1" 200 259
192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/head.jsp HTTP/1.1" 200 713
192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/body.jsp HTTP/1.1" 200 240
192.168.88.1 - - [30/Jul/2017:12:54:37 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:38 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:38 +0800] "GET /MyDemoWeb/java.jsp HTTP/1.1" 200 240
192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:40 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:41 +0800] "GET /MyDemoWeb/mysql.jsp HTTP/1.1" 200 241
192.168.88.1 - - [30/Jul/2017:12:54:41 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:42 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239
192.168.88.1 - - [30/Jul/2017:12:54:42 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:52 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:52 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:53 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:54 +0800] "GET /MyDemoWeb/mysql.jsp HTTP/1.1" 200 241
192.168.88.1 - - [30/Jul/2017:12:54:54 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:54 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:56 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239
192.168.88.1 - - [30/Jul/2017:12:54:56 +0800] "GET /MyDemoWeb/java.jsp HTTP/1.1" 200 240
192.168.88.1 - - [30/Jul/2017:12:54:57 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:57 +0800] "GET /MyDemoWeb/java.jsp HTTP/1.1" 200 240
192.168.88.1 - - [30/Jul/2017:12:54:58 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:58 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:59 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:54:59 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:55:00 +0800] "GET /MyDemoWeb/mysql.jsp HTTP/1.1" 200 241
192.168.88.1 - - [30/Jul/2017:12:55:00 +0800] "GET /MyDemoWeb/oracle.jsp HTTP/1.1" 200 242
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/web.jsp HTTP/1.1" 200 239
192.168.88.1 - - [30/Jul/2017:12:55:02 +0800] "GET /MyDemoWeb/hadoop.jsp HTTP/1.1" 200 242
原理解析:为什么存入数据库的时候需要使用 rdd1.foreachPartition(saveToOracle)方法?为什么无法直接进行数据库的存储?
我们通过传统的数据库连接获取,执行SQL和保存代码运行:
var conn:Connection = null
var pst:PreparedStatement = null
try{
//创建一个Connection
//conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.157.101:1521/orcl.example.com","scott", "tiger")
conn = DriverManager.getConnection("jdbc:mysql://10.30.3.104:3306/test","root", "2561304")//插入mysql数据库
//把rdd1的每条数据都插入到Oracle中
rdd1.foreach(f => {
//插入数据
pst = conn.prepareStatement("insert into result values(?,?)")
pst.setString(1, f._1) //JSP的名字
pst.setInt(2,f._2) //记一次数
pst.executeUpdate()
})
}catch{
case e1:Exception => e1.printStackTrace()
}finally{
if(pst != null) pst.close()
if(conn != null) conn.close()
}
运行后出现如下错误:提示没有序列化
Serialization stack:
- object not serializable (class: java.lang.Object, value: java.lang.Object@7343922c)
- writeObject data (class: java.util.HashMap)
原因就在于connection连接RDD时是针对整个RDD使用一个connection,但是每一个RDD里面存在多个分区。connection在多个节点(分区)上处理的时候,需要传输connection,但是connection是非序列化的,因此connection在处理完分区1,再去处理分区2的时候,connection无法传输(没有序列化)。因此解决的办法就是针对每一个分区创建一个connection连接。那么针对每一个分区都有connection去操作,connection不需要网络传输。而针对每一个分区进行操作的方法就是:
RDD.scala
def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}
七、完整代码
package lesson0613
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import java.sql.Connection
import java.sql.DriverManager
import java.sql.PreparedStatement
object MyOracleDemo {
def main(args: Array[String]): Unit = {
//定义SparkContext对象
val conf = new SparkConf().setAppName("MyWebLogDemo").setMaster("local")
val sc = new SparkContext(conf)
// 读入日志文件
//rdd1结果 :(hadoop.jsp,1)
val rdd1 = sc.textFile("D:\\localhost_access_log.2017-07-30.txt").map{
//line: 相当于value1
line => {
//处理该行日志: 192.168.88.1 - - [30/Jul/2017:12:53:43 +0800] "GET /MyDemoWeb/head.jsp HTTP/1.1" 200 713
//解析字符串,找到jsp的名字
//第一步解析出:GET /MyDemoWeb/head.jsp HTTP/1.1
val index1 = line.indexOf("\"") //第一个双引号的位置
val index2 = line.lastIndexOf("\"") //第二个双引号的位置
val str1 = line.substring(index1+1,index2)
//第二步解析出:/MyDemoWeb/head.jsp
val index3 = str1.indexOf(" ")
val index4 = str1.lastIndexOf(" ")
val str2 = str1.substring(index3+1, index4)
//第三步解析出: head.jsp
val jspName = str2.substring(str2.lastIndexOf("/")+1)
//返回 (hadoop.jsp,1) ---> 保存到Oracle中
(jspName,1)
}
}
//针对分区,创建Connection,将结果保存到数据库中
rdd1.foreachPartition(saveToOracle)
// var conn:Connection = null
// var pst:PreparedStatement = null
// try{
// //创建一个Connection
// //conn = DriverManager.getConnection("jdbc:oracle:thin:@192.168.157.101:1521/orcl.example.com","scott", "tiger")
// conn = DriverManager.getConnection("jdbc:mysql://10.30.3.104:3306/test","root", "2561304")//插入mysql数据库
//
// //把rdd1的每条数据都插入到Oracle中
// rdd1.foreach(f => {
// //插入数据
// pst = conn.prepareStatement("insert into result values(?,?)")
// pst.setString(1, f._1) //JSP的名字
// pst.setInt(2,f._2) //记一次数
// pst.executeUpdate()
// })
// }catch{
// case e1:Exception => e1.printStackTrace()
// }finally{
// if(pst != null) pst.close()
// if(conn != null) conn.close()
// }
//
// }
//定义一个函数,将某一个分区中的数据(jspname,count)保存到Oracle中
def saveToOracle(it:Iterator[(String,Int)]) ={
var conn:Connection = null
var pst:PreparedStatement = null
try{
//创建一个Connection
//conn = DriverManager.getConnection("jdbc:oracle:thin:@10.30.30.210:1521/orcl.example.com","scott", "tiger")
conn = DriverManager.getConnection("jdbc:mysql://10.30.3.104:3306/test","root", "2561304")//插入mysql数据库
pst = conn.prepareStatement("insert into result values(?,?)")
//把rdd1的一个分区中的数据插入到Oracle中
it.foreach(data =>{
pst.setString(1, data._1)
pst.setInt(2,data._2)
pst.executeUpdate()
})
}catch{
case e1:Exception => e1.printStackTrace()
}finally{
if(pst != null) pst.close()
if(conn != null) conn.close()
}
}
}