//kafka的安装配置及sparksteaming消费
# by  coco
# 2015-07-06

前期准备 zookeeper在如下机器上运行
192.168.8.94
192.168.8.95
192.168.8.96

目前安装kafka集群模式:
192.168.8.98
192.168.8.97

1. 安装zookeeper集群。(略)

2. 安装kafka
wget http://apache.dataguru.cn/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz 
或者:
curl -L -O http://mirrors.cnnic.cn/apache/kafka/0.9.0.0/kafka_2.10-0.9.0.0.tgz 

解压: tar -xvf kafka_2.10-0.8.1.tar -C /usr/local/
修改配置文件:vim ./config/server.properties
log.dirs=/data/kafka-logs 
zookeeper.connect=192.168.8.94:2181  


启动:/usr/local/kafka/bin/kafka-server-start.sh  /usr/local/kafka/config/server.properties &

测试发送信息: 该信息发送的97服务器上的topic
^C[root@bogon config]# /usr/local/kafka/bin/kafka-console-producer.sh --broker-list 192.168.8.97:9092 --topic test
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
eeeee
dddddd
ttttt

测试接受端:
[root@bogon ~]# /usr/local/kafka/bin/kafka-console-consumer.sh --zookeeper 192.168.8.94:2181 --topic test --from-beginning
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
eeeee
dddddd
ttttt

表示服务正常可用。


二、高可用测试

关闭97服务器上的kafka服务,发送信息接受信息还能使用。因为与98服务器是分布式模式。发现测试的过程中如果是异常关闭服务,会丢失个别信息。cccc信息丢失。


下面是脚本每秒比低于20个消息打入kafka队列的脚本:


from gcutils.queue import KafkaMgr
import  time


mgr=KafkaMgr("192.168.8.98:9092")

while 1:
    mgr.send_message("test","aaaaaa")
    time.sleep(0.01)
下面测试spark streaming消费kafka队列的消息:

####### 安装spark服务:
下载spark版本:
 spark -> /usr/local/spark-2.0.2-bin-hadoop2.6/      具体包在:192.168.8.98服务器上。 

其中修改3个配置文件:
[root@hadoop98 conf]# cat core-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration><property>
        <name>fs.defaultFS</name>
        <value> hdfs://192.168.8.94:8020</value>
</property>
</configuration>

修改与Hive结合的配置:
[root@hadoop98 conf]# cat hive-site.xml 
<configuration>  
  
<property>  
  <name>hive.metastore.warehouse.dir</name>  
    <value>/user/hive/warehouse</value>  
    </property>  
       
<property>  
  <name>javax.jdo.option.ConnectionURL</name>  
    <value> jdbc:mysql://192.168.8.94:3306/hive?createDatabaseIfNotExist=true</value>  
    </property>  
       
<property>  
  <name>javax.jdo.option.ConnectionDriverName</name>  
    <value>com.mysql.jdbc.Driver</value>  
    </property>  
       
<property>  
  <name>javax.jdo.option.ConnectionUserName</name>  
    <value> hive</value>  
    </property>  
       
<property>  
  <name>javax.jdo.option.ConnectionPassword</name>  
    <value> gc895316</value>  
    </property>  
</configuration> 

spark默认的配置,这里全部注释了。没有启用
[root@hadoop98 conf]# cat spark-defaults.conf 
# Example:
#spark.master                     spark://172.17.17.105:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
#spark.driver.memory              2g
#spark.executor.cores             1
#spark.executor.memory            2g

# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"


启动spark
[root@hadoop98 kafkatest]#  /usr/local/spark/bin/pyspark --jars spark-streaming-kafka-0-8-assembly_2.11-2.0.2.jar      //这里只是带jar包的启动方式,简单的 pyspark即可启动。 

测试spark
[root@hadoop98 kafkatest]#  /usr/local/spark/bin/pyspark --jars spark-streaming-kafka-0-8-assembly_2.11-2.0.2.jar 
Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Dec  6 2015, 18:08:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
17/04/07 12:02:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/04/07 12:02:52 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.2
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkSession available as 'spark'.
>>> 
>>>  from pyspark.streaming import StreamingContext 
from pyspark.streaming.kafka import KafkaUtils
>>>  from pyspark.streaming.kafka import KafkaUtils
>>>  ssc = StreamingContext(sc, 2)
kvs = KafkaUtils.createStream(ssc, "192.168.8.94:2181", "spark-streaming-consumer", {"test": 2})
>>>  kvs = KafkaUtils.createStream(ssc, "192.168.8.94:2181", "spark-streaming-consumer", {"test": 2})
count = kvs.count()
count.pprint()
ssc.start()>>>  count = kvs.count()
>>>  count.pprint()
>>>  ssc.start()
>>> 17/04/07 12:03:23 WARN AppInfo$: Can't read Kafka version from MANIFEST.MF. Possible cause: java.lang.NullPointerException
17/04/07 12:03:23 WARN RangeAssignor: No broker partitions consumed by consumer thread spark-streaming-consumer_hadoop98-1491537803168-745cc09a-0 for topic test
17/04/07 12:03:23 WARN RangeAssignor: No broker partitions consumed by consumer thread spark-streaming-consumer_hadoop98-1491537803168-745cc09a-1 for topic test
-------------------------------------------
Time: 2017-04-07 12:03:22
-------------------------------------------

-------------------------------------------
Time: 2017-04-07 12:03:24
—————————————————————

spark streaming成功消费kafka队列中的消息。