主要内容

一、概述

二、集群安装

三、整合kafka与storm

四、融合kafka、storm与hdfs

五、创建kafka客户端代替kafka-console-producer(未完成,暂时使用其他人发送过来的日志文件)

部分内容转自:http://shiyanjun.cn/archives/934.html

一、概述

(一)机器环境

共4台机器:

nn01 dn01 dn02 dn03  192.168.169.91-94

storm集群:全用,其中nn01为nimbus与ui,其它为supervisor

kafka集群:dn01 dn02 dn03

zk集群:nn01 dn01 dn02

(二)本文主要内容3个示例

1、整合kafka与storm,将kafka中的数据经过kafkaSpout发送到storm中进行处理。

2、整合kafka、storm与hdfs,在上述示例的基础上,使用hdfsBolt将数据写入hdfs。

3、使用kafka客户端从日志文件中读取消息,并发送到kafka,然后继续第2步流程。注:上面2个示例均是使用kafka-console-producer来产生消息的。

说明:

1、事实上,以上3个示例为同一个示例的不同步骤,只是为了方便调试分成不同步骤而已,最终结果即是示例三。因此示例1、2中的部分代码不规范,均在示例3均作了完善。

2、在正式使用代码时,只需要完成集群安装及相应的示例代码即可。如需要使用示例3,则只需要完成第二及第五部分。

(三)关于代码

见:



二、集群安装

主要步骤如下:

1、安装及启动zookeeper集群

2、安装及启动storm集群

3、安装及启动kafka集群,并创建topic和producer,然后创建consumer验证集群正常

(一)安装及启动zookeeper集群

见安装zookeeper集群

(二)安装及启动storm集群

见安装storm集群

(三)安装及启动kafka集群,并创建topic和producer

1、使用3台机器搭建Kafka集群:

192.168.169.92 gdc-dn01-test 

 192.168.169.93 gdc-dn02-test 

 192.168.169.94 gdc-dn03-test



2、在安装Kafka集群之前,这里没有使用Kafka自带的Zookeeper,而是独立安装了一个Zookeeper集群,也是使用这3台机器,保证Zookeeper集群正常运行。
3、首先,在gdc-dn01-test上准备Kafka安装文件,执行如下命令:

cd 


 wget http://mirrors.cnnic.cn/apache/kafka/0.8.2.1/kafka_2.10-0.8.2.1.tgz 


 tar xvzf kafka_2.10-0.8.2.1.tgz 


 mv kafka_2.10-0.8.2.1 kafka



4、修改配置文件kafka/config/server.properties,修改如下内容:





broker.id=0

zookeeper.connect=192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181/kafka
这里需要说明的是,默认Kafka会使用ZooKeeper默认的/路径,这样有关Kafka的ZooKeeper配置就会散落在根路径下面,如果 你有其他的应用也在使用ZooKeeper集群,查看ZooKeeper中数据可能会不直观,所以强烈建议指定一个chroot路径,直接在 zookeeper.connect配置项中指定:


zookeeper.connect=192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181/kafka
而且,需要手动在ZooKeeper中创建路径/kafka,使用如下命令连接到任意一台ZooKeeper服务器:


cd ~/zookeeper

bin/zkCli.sh
在ZooKeeper执行如下命令创建chroot路径:


create /kafka ''
这样,每次连接Kafka集群的时候(使用--zookeeper选项),也必须使用带chroot路径的连接字符串,后面会看到。


5、然后,将配置好的安装文件同步到其他的dn02、dn03节点上:

scp -r /usr/local/kafka_2.10-0.8.2.1/ 192.168.169.92:/home/hadoop 


 scp -r /usr/local/kafka_2.10-0.8.2.1/ 192.168.169.93:/home/hadoop



6、最后,在dn02、dn03节点上配置修改配置文件kafka/config/server.properties内容如下所示:


broker.id=1  # 在dn02修改

broker.id=2  # 在dn03修改
因为Kafka集群需要保证各个Broker的id在整个集群中必须唯一,需要调整这个配置项的值(如果在单机上,可以通过建立多个Broker进程来模拟分布式的Kafka集群,也需要Broker的id唯一,还需要修改一些配置目录的信息)。
7、在集群中的dn01、dn02、dn03这三个节点上分别启动Kafka,分别执行如下命令:


bin/kafka-server-start.sh config/server.properties &
可以通过查看日志,或者检查进程状态,保证Kafka集群启动成功。
8、创建一个名称为my-replicated-topic5的Topic,5个分区,并且复制因子为3,执行如下命令:


bin/kafka-topics.sh --create --zookeeper 192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181/kafka --replication-factor 3 --partitions 5 --topic my-replicated-topic5
9、查看创建的Topic,执行如下命令:


bin/kafka-topics.sh --describe --zookeeper 192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181/kafka --topic my-replicated-topic5
结果信息如下所示:

Topic:my-replicated-topic5  PartitionCount:5  ReplicationFactor:3  Configs:
 Topic: my-replicated-topic5  Partition: 0  Leader: 2   Replicas: 2,0,1  Isr: 2,0,1
 Topic: my-replicated-topic5  Partition: 1  Leader: 0  Replicas: 0,1,2   Isr: 0,1,2
 Topic: my-replicated-topic5  Partition: 2  Leader: 1  Replicas: 1,2,0  Isr: 1,2,0
 Topic: my-replicated-topic5  Partition: 3  Leader: 2   Replicas: 2,1,0  Isr: 2,1,0
 Topic: my-replicated-topic5  Partition: 4  Leader: 0  Replicas: 0,2,1  Isr: 0,2,1
上面Leader、Replicas、Isr的含义如下:

1    Partition: 分区
2    Leader   : 负责读写指定分区的节点
3    Replicas : 复制该分区log的节点列表
4    Isr      : "in-sync" replicas,当前活跃的副本列表(是一个子集),并且可能成为Leader
我们可以通过Kafka自带的bin/kafka-console-producer.sh和bin/kafka-console-consumer.sh脚本,来验证演示如果发布消息、消费消息。
11、在一个终端,启动Producer,并向我们上面创建的名称为my-replicated-topic5的Topic中生产消息,执行如下脚本:


bin/kafka-console-producer.sh --broker-list 192.168.169.92:9092, 192.168.169.93:9092, 192.168.169.94:9092 --topic my-replicated-topic5

12、在另一个终端,启动Consumer,并订阅我们上面创建的名称为my-replicated-topic5的Topic中生产的消息,执行如下脚本:

bin/kafka-console-consumer.sh --zookeeper 192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181/kafka --from-beginning --topic my-replicated-topic5
可以在Producer终端上输入字符串消息行,就可以在Consumer终端上看到消费者消费的消息内容。
也可以参考Kafka的Producer和Consumer的Java API,通过API编码的方式来实现消息生产和消费的处理逻辑。

三、整合kafka与strom

主要步骤

1、创建storm-kafka topology

(1)创建maven project及pom.xml

(2)MyKafkaTopology.java

(3)提交topology到storm集群

2、在producer中输入内容,观察输出情况

(一)创建storm-kafka topology

消息通过各种方式进入到Kafka消息中间件,比如可以通过使用Flume来收集日志数据,然后在Kafka中路由暂存,然后再由实时计算程序 Storm做实时分析,这时我们就需要将在Storm的Spout中读取Kafka中的消息,然后交由具体的Spot组件去分析处理。实际 上,apache-storm-0.9.4这个版本的Storm已经自带了一个集成Kafka的外部插件程序storm- kafka,可以直接使用,

1、在eclipse中创建maven project,pom.xml如下

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 


   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 

   <modelVersion>4.0.0</modelVersion> 


   <groupId>com.ljh.demo</groupId> 

   <artifactId>stormkafkademo</artifactId> 

   <version>0.0.1-SNAPSHOT</version> 

   <packaging>jar</packaging> 


   <name>stormkafkademo</name> 

   <url>http://maven.apache.org</url> 


   <properties> 

     <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> 

   </properties> 


    <dependencies> 

 <dependency> 

      <groupId>org.apache.storm</groupId> 

      <artifactId>storm-core</artifactId> 

      <version>0.9.4</version> 

      <scope>provided</scope> 

 </dependency> 

 <dependency> 

      <groupId>org.apache.storm</groupId> 

      <artifactId>storm-kafka</artifactId> 

      <version>0.9.4</version> 

 </dependency> 

 <dependency> 

      <groupId>org.apache.kafka</groupId> 

      <artifactId>kafka_2.10</artifactId> 

      <version>0.8.2.1</version> 

      <exclusions> 

           <exclusion> 

                <groupId>org.apache.zookeeper</groupId> 

                <artifactId>zookeeper</artifactId> 

           </exclusion> 

           <exclusion> 

                <groupId>log4j</groupId> 

                <artifactId>log4j</artifactId> 

           </exclusion> 

      </exclusions> 

 </dependency> 


   </dependencies> 

 </project> 


 2、开发一个简单WordCount示例程序,从Kafka读取订阅的消息行,通过空格拆分出单个单词,然后再做词频统计计算,实现的Topology的代码,如下所示: 


 package com.ljh.demo.stormkafkademo; 


 import java.util.Arrays; 


 import java.util.HashMap; 


 import java.util.Iterator; 


 import java.util.Map; 


 import java.util.Map.Entry; 


 import java.util.concurrent.atomic.AtomicInteger; 




 import org.apache.commons.logging.Log; 


 import org.apache.commons.logging.LogFactory; 




 import storm.kafka.BrokerHosts; 


 import storm.kafka.KafkaSpout; 


 import storm.kafka.SpoutConfig; 


 import storm.kafka.StringScheme; 


 import storm.kafka.ZkHosts; 


 import backtype.storm.Config; 


 import backtype.storm.LocalCluster; 


 import backtype.storm.StormSubmitter; 


 import backtype.storm.generated.AlreadyAliveException; 


 import backtype.storm.generated.InvalidTopologyException; 


 import backtype.storm.spout.SchemeAsMultiScheme; 


 import backtype.storm.task.OutputCollector; 


 import backtype.storm.task.TopologyContext; 


 import backtype.storm.topology.OutputFieldsDeclarer; 


 import backtype.storm.topology.TopologyBuilder; 


 import backtype.storm.topology.base.BaseRichBolt; 


 import backtype.storm.tuple.Fields; 


 import backtype.storm.tuple.Tuple; 


 import backtype.storm.tuple.Values; 




 public class MyKafkaTopology { 




      public static class KafkaWordSplitter extends BaseRichBolt { 




           private static final Log LOG = LogFactory.getLog(KafkaWordSplitter.class); 


           private static final long serialVersionUID = 886149197481637894L; 


           private OutputCollector collector; 




           public void prepare(Map stormConf, TopologyContext context, 


                     OutputCollector collector) { 


                this.collector = collector; 


           } 




           public void execute(Tuple input) { 


                String line = input.getString(0); 


                LOG.info("RECV[kafka -> splitter] " + line); 


                String[] words = line.split("\\s+"); 


                for(String word : words) { 


                     LOG.info("EMIT[splitter -> counter] " + word); 


                     collector.emit(input, new Values(word, 1)); 


                } 


                collector.ack(input); 


           } 




           public void declareOutputFields(OutputFieldsDeclarer declarer) { 


                declarer.declare(new Fields("word", "count")); 


           } 




      } 




      public static class WordCounter extends BaseRichBolt { 




           private static final Log LOG = LogFactory.getLog(WordCounter.class); 


           private static final long serialVersionUID = 886149197481637894L; 


           private OutputCollector collector; 


           private Map<String, AtomicInteger> counterMap; 




           public void prepare(Map stormConf, TopologyContext context, 


                     OutputCollector collector) { 


                this.collector = collector; 


                this.counterMap = new HashMap<String, AtomicInteger>(); 


           } 




           public void execute(Tuple input) { 


                String word = input.getString(0); 


                int count = input.getInteger(1); 


                LOG.info("RECV[splitter -> counter] " + word + " : " + count); 


                AtomicInteger ai = this.counterMap.get(word); 


                if(ai == null) { 


                     ai = new AtomicInteger(); 


                     this.counterMap.put(word, ai); 


                } 


                ai.addAndGet(count); 


                collector.ack(input); 


                LOG.info("CHECK statistics map: " + this.counterMap); 


           } 




           @Override 


           public void cleanup() { 


                LOG.info("The final result:"); 


                Iterator<Entry<String, AtomicInteger>> iter = this.counterMap.entrySet().iterator(); 


                while(iter.hasNext()) { 


                     Entry<String, AtomicInteger> entry = iter.next(); 


                     LOG.info(entry.getKey() + "\t:\t" + entry.getValue().get()); 


                } 




           } 




           public void declareOutputFields(OutputFieldsDeclarer declarer) { 


                declarer.declare(new Fields("word", "count")); 


           } 


      } 




      public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException, InterruptedException { 


           String zks = "192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181"; 


           String topic = "my-replicated-topic5"; 


           String zkRoot = "/storm"; // default zookeeper root configuration for storm 


           String id = "word"; 




           BrokerHosts brokerHosts = new ZkHosts(zks,"/kafka/brokers"); 


           SpoutConfig spoutConf = new SpoutConfig(brokerHosts, topic, zkRoot, id); 


           spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme()); 


           spoutConf.forceFromStart = false; 


           spoutConf.zkServers = Arrays.asList(new String[] {"192.168.169.91", "192.168.169.92", "192.168.169.93"}); 


           spoutConf.zkPort = 2181; 




           TopologyBuilder builder = new TopologyBuilder(); 


           builder.setSpout("kafka-reader", new KafkaSpout(spoutConf), 5); // Kafka我们创建了一个5分区的Topic,这里并行度设置为5 


           builder.setBolt("word-splitter", new KafkaWordSplitter(), 2).shuffleGrouping("kafka-reader"); 


           builder.setBolt("word-counter", new WordCounter()).fieldsGrouping("word-splitter", new Fields("word")); 




           Config conf = new Config(); 




           String name = MyKafkaTopology.class.getSimpleName(); 


           System.out.println("test"); 


           if (args != null && args.length > 0) { 


                // Nimbus host name passed from command line 


                conf.put(Config.NIMBUS_HOST, args[0]); 


                conf.setNumWorkers(3); 


                StormSubmitter.submitTopologyWithProgressBar(name, conf, builder.createTopology()); 


           } else { 


                conf.setMaxTaskParallelism(3); 


                LocalCluster cluster = new LocalCluster(); 


                cluster.submitTopology(name, conf, builder.createTopology()); 


                Thread.sleep(60000); 


                cluster.shutdown(); 


           } 


      } 


 }




3、提交我们开发的Topology程序到storm集群中


storm jar stormkafkademo-0.0.1-SNAPSHOT.jar com.ljh.demo.stormkafkademo.MyKafkaTopology 192.168.169.91

(二)查看topology的运行结果
可以通过查看日志文件(logs/目录下)或者Storm UI来监控Topology的运行状况。如果程序没有错误,可以使用前面我们使用的Kafka Producer来生成消息,就能看到我们开发的Storm Topology能够实时接收到并进行处理。
上面Topology实现代码中,有一个很关键的配置对象SpoutConfig,配置属性如下所示:


spoutConf.forceFromStart = false;
该配置是指,如果该Topology因故障停止处理,下次正常运行时是否从Spout对应数据源Kafka中的该订阅Topic的起始位置开始读 取,如果forceFromStart=true,则之前处理过的Tuple还要重新处理一遍,否则会从上次处理的位置继续处理,保证Kafka中的 Topic数据不被重复处理,是在数据源的位置进行状态记录。














四、整合kafka、storm与hdfs
主要步骤:
1、创建topic并启动producer
2、WordSplitterBolt.java
3、WordCounterBolt.java
4、KafkaStormHdfsTopology.java
5、pom.xml内容如下:
6、编译后把集群提交到storm
7、在kafka-console-producer中输入内容,并查看结果
注:本示例使用第二部分创建的集群

1、创建topic并启动producer 

 bin/kafka-topics.sh --create --zookeeper 192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181/kafka --replication-factor 3 --partitions 5 --topic test3 

 bin/kafka-console-producer.sh --broker-list 192.168.169.92:9092, 192.168.169.93:9092, 192.168.169.94:9092 --topic test3 

 2、WordSplitterBolt.java 

 package com.ljh.demo.stormkafkademo; 


 import java.util.Map; 


 import org.apache.commons.logging.Log; 

 import org.apache.commons.logging.LogFactory; 


 import backtype.storm.task.OutputCollector; 

 import backtype.storm.task.TopologyContext; 

 import backtype.storm.topology.OutputFieldsDeclarer; 

 import backtype.storm.topology.base.BaseRichBolt; 

 import backtype.storm.tuple.Fields; 

 import backtype.storm.tuple.Tuple; 

 import backtype.storm.tuple.Values; 

 /* 

  * 将kafka发送过来的数据按照空格拆分 

  */ 

 public class WordSplitterBolt extends BaseRichBolt { 


     private static final Log LOG = LogFactory.getLog(WordSplitterBolt.class); 

     private static final long serialVersionUID = 65437289133090921L; 

     private OutputCollector collector; 


     public void prepare(Map stormConf, TopologyContext context, 

     OutputCollector collector) { 

         this.collector = collector; 

     } 


     public void execute(Tuple input) { 

         String line = input.getString(0); 

         LOG.info("RECV[kafka -> splitter] " + line); 

         String[] words = line.split("\\s+"); 

         for (String word : words) { 

             LOG.info("EMIT[splitter -> counter] " + word); 

             // collector.emit(new Values(word,1)); 

             collector.emit(input, new Values(word, 1)); 

         } 

         collector.ack(input); 

     } 


     public void declareOutputFields(OutputFieldsDeclarer declarer) { 

         declarer.declare(new Fields("word", "count")); 

     } 

 } 

 3、WordCounterBolt.java 

 package com.ljh.demo.stormkafkademo.bolt; 


 import java.util.HashMap; 

 import java.util.Iterator; 

 import java.util.Map; 

 import java.util.Map.Entry; 

 import java.util.concurrent.atomic.AtomicInteger; 


 import org.apache.commons.logging.Log; 

 import org.apache.commons.logging.LogFactory; 


 import backtype.storm.task.OutputCollector; 

 import backtype.storm.task.TopologyContext; 

 import backtype.storm.topology.OutputFieldsDeclarer; 

 import backtype.storm.topology.base.BaseRichBolt; 

 import backtype.storm.tuple.Fields; 

 import backtype.storm.tuple.Tuple; 

 import backtype.storm.tuple.Values; 


 public class WordCounterBolt extends BaseRichBolt { 

     private static final Log LOG = LogFactory.getLog(WordCounterBolt.class); 

     private static final long serialVersionUID = 8557689093276234L; 

     private OutputCollector collector; 

     private Map<String, AtomicInteger> counterMap; 

     public void prepare(Map stormConf, TopologyContext context, 

     OutputCollector collector) { 

         this.collector = collector; 

         this.counterMap = new HashMap<String, AtomicInteger>(); 

     } 

     public void execute(Tuple input) { 

         String word = input.getString(0); 

         int count = input.getInteger(1); 

         LOG.info("RECV[splitter -> counter] " + word + " : " + count); 

         AtomicInteger ai = this.counterMap.get(word); 

         if (ai == null) { 

             ai = new AtomicInteger(); 

             this.counterMap.put(word, ai); 

         } 

         ai.addAndGet(count); 

         collector.emit(input, new Values(word, ai)); 

         // collector.ack(input); 

         LOG.info("CHECK statistics map: " + this.counterMap); 

     } 

     //在topology被kill掉时执行,一般用于本地调试 

     @Override 

     public void cleanup() { 

         LOG.info("The final result:"); 

         Iterator<Entry<String, AtomicInteger>> iter = this.counterMap 

                 .entrySet().iterator(); 

         while (iter.hasNext()) { 

             Entry<String, AtomicInteger> entry = iter.next(); 

             LOG.info(entry.getKey() + "\t:\t" + entry.getValue().get()); 

         } 

     } 

     public void declareOutputFields(OutputFieldsDeclarer declarer) { 

         declarer.declare(new Fields("word", "count")); 

     } 

 } 

 4、KafkaStormHdfsTopology.java 

 package com.ljh.demo.stormkafkademo; 


 import java.util.Arrays; 

 import java.util.HashMap; 

 import java.util.Iterator; 

 import java.util.Map; 

 import java.util.Map.Entry; 

 import java.util.concurrent.atomic.AtomicInteger; 


 import org.apache.commons.logging.Log; 

 import org.apache.commons.logging.LogFactory; 

 import org.apache.storm.hdfs.bolt.HdfsBolt; 

 import org.apache.storm.hdfs.bolt.format.DefaultFileNameFormat; 

 import org.apache.storm.hdfs.bolt.format.DelimitedRecordFormat; 

 import org.apache.storm.hdfs.bolt.format.FileNameFormat; 

 import org.apache.storm.hdfs.bolt.format.RecordFormat; 

 import org.apache.storm.hdfs.bolt.rotation.FileRotationPolicy; 

 import org.apache.storm.hdfs.bolt.rotation.TimedRotationPolicy; 

 import org.apache.storm.hdfs.bolt.rotation.TimedRotationPolicy.TimeUnit; 

 import org.apache.storm.hdfs.bolt.sync.CountSyncPolicy; 

 import org.apache.storm.hdfs.bolt.sync.SyncPolicy; 


 import com.ljh.demo.stormkafkademo.bolt.WordCounterBolt; 


 import storm.kafka.BrokerHosts; 

 import storm.kafka.KafkaSpout; 

 import storm.kafka.SpoutConfig; 

 import storm.kafka.StringScheme; 

 import storm.kafka.ZkHosts; 

 import backtype.storm.Config; 

 import backtype.storm.LocalCluster; 

 import backtype.storm.StormSubmitter; 

 import backtype.storm.generated.AlreadyAliveException; 

 import backtype.storm.generated.InvalidTopologyException; 

 import backtype.storm.spout.SchemeAsMultiScheme; 

 import backtype.storm.task.OutputCollector; 

 import backtype.storm.task.TopologyContext; 

 import backtype.storm.topology.OutputFieldsDeclarer; 

 import backtype.storm.topology.TopologyBuilder; 

 import backtype.storm.topology.base.BaseRichBolt; 

 import backtype.storm.tuple.Fields; 

 import backtype.storm.tuple.Tuple; 

 import backtype.storm.tuple.Values; 


 /* 

  * 本类完成以下内容 

  * 1、从kafka中读取消息:通过kafkaSpout 

  * 2、在storm中对消息进行处理,主要完成(1)WordSplitterBolt对消息进行拆分(2)WordCounterBolt进行统计 

  * 3、将结果输出到hdfs:通过hdfsBolt 

  */ 

 public class KafkaStormHdfsTopology { 

     private static final String h1 = "192.168.169.91"; 

     private static final String h2 = "192.168.169.92"; 

     private static final String h3 = "192.168.169.93"; 


     //args[0]:nimbus hostname; args[1]:topology名称; args[2]:topic名称。 三都均为optional 

     public static void main(String[] args) throws AlreadyAliveException, 

             InvalidTopologyException, InterruptedException { 


         String zks = h1 + ":2181," + h2 + ":2181," + h3 + ":2181"; 

         String topic = null; 


         // 指定topic名称 

         if (args.length <= 2) { 

             topic = "kafka-storm-hdfs-demo"; 

         } else { 

             topic = args[2]; 

         } 

         // storm信息在zookeeper中的默认存储僧 

         String zkRoot = "/storm"; 

         String id = "word"; 


         // 1、配置spout从kafka中读取消息 

         // 默认情况下,brokers的目录位于/brokers,为方便管理,将kafka相关的信息都放入了/kafka中。 

         BrokerHosts brokerHosts = new ZkHosts(zks, "/kafka/brokers"); 

         SpoutConfig spoutConf = new SpoutConfig(brokerHosts, topic, zkRoot, id); 

         spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme()); 

         spoutConf.forceFromStart = false; 

         spoutConf.zkServers = Arrays.asList(new String[] { h1, h2, h3 }); 

         spoutConf.zkPort = 2181; 


         // 2、配置hdfs-bolt,将结果写入hdfs中 

         RecordFormat format = new DelimitedRecordFormat() 

                 .withFieldDelimiter("\t"); 

         // 每收到1000个tuple同步一次到hdfs 

         SyncPolicy syncPolicy = new CountSyncPolicy(1000); 

         // 每分钟同步一次到hdfs。只要满足1分钟,或者1000tuple均会触发同步。 

         FileRotationPolicy rotationPolicy = new TimedRotationPolicy(1.0f, 

                 TimeUnit.MINUTES); // rotate files 

         FileNameFormat fileNameFormat = new DefaultFileNameFormat() 

                 .withPath("/storm/").withPrefix("app_").withExtension(".log"); // set 

                                                                                 // file 

                                                                                 // name 

                                                                                 // format 

         HdfsBolt hdfsBolt = new HdfsBolt() 

                 .withFsUrl("hdfs://gdc-nn01-test:9000") 

                 .withFileNameFormat(fileNameFormat).withRecordFormat(format) 

                 .withRotationPolicy(rotationPolicy).withSyncPolicy(syncPolicy); 


         // 3、创建topology 

         TopologyBuilder builder = new TopologyBuilder(); 

         builder.setSpout("kafka-reader", new KafkaSpout(spoutConf), 5); // Kafka鎴戜滑鍒涘缓浜嗕竴涓5鍒嗗尯鐨凾opic锛岃繖閲屽苟琛屽害璁剧疆涓5 

         builder.setBolt("word-splitter", new WordSplitterBolt(), 2) 

                 .shuffleGrouping("kafka-reader"); 

         builder.setBolt("word-counter", new WordCounterBolt()).fieldsGrouping( 

                 "word-splitter", new Fields("word")); 

         builder.setBolt("hdfs-bolt", hdfsBolt, 2).shuffleGrouping( 

                 "word-counter"); 


         // 4、启动topology 

         Config conf = new Config(); 

         String topologyName = null; 

         if (args.length <= 1) { 

             topologyName = KafkaStormHdfsTopology.class.getSimpleName(); 

         } else { 

             topologyName = args[1]; 

         } 

         String nimbusHost = null; 

         if(args.length == 0){ 

             nimbusHost = h1; 

         }else{ 

             nimbusHost = args[0]; 

         } 


         System.out.println("test"); 


         if (args != null && args.length > 0) { 

             conf.put(Config.NIMBUS_HOST, nimbusHost); 

             conf.setNumWorkers(3); 

             StormSubmitter.submitTopologyWithProgressBar(topologyName, conf, 

                     builder.createTopology()); 

         } else { 

             conf.setMaxTaskParallelism(3); 

             LocalCluster cluster = new LocalCluster(); 

             cluster.submitTopology(topologyName, conf, builder.createTopology()); 

             Thread.sleep(60000); 

             cluster.shutdown(); 


         } 


     } 


 } 

 5、pom.xml内容如下: 

 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 

     xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> 

     <modelVersion>4.0.0</modelVersion> 


     <groupId>com.ljh.demo</groupId> 

     <artifactId>stormkafkademo</artifactId> 

     <version>0.0.1-SNAPSHOT</version> 

     <packaging>jar</packaging> 


     <name>stormkafkademo</name> 

     <url>http://maven.apache.org</url> 


     <properties> 

         <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> 

     </properties> 


     <dependencies> 

         <dependency> 

             <groupId>org.apache.storm</groupId> 

             <artifactId>storm-core</artifactId> 

             <version>0.9.4</version> 

             <scope>provided</scope> 

         </dependency> 

         <dependency> 

             <groupId>org.apache.storm</groupId> 

             <artifactId>storm-kafka</artifactId> 

             <version>0.9.4</version> 

         </dependency> 

         <dependency> 

             <groupId>org.apache.storm</groupId> 

             <artifactId>storm-hdfs</artifactId> 

             <version>0.9.4</version> 

         </dependency> 

          



         <dependency> 

             <groupId>org.apache.kafka</groupId> 

             <artifactId>kafka_2.10</artifactId> 

             <version>0.8.2.1</version> 

             <exclusions> 

                 <exclusion> 

                     <groupId>org.apache.zookeeper</groupId> 

                     <artifactId>zookeeper</artifactId> 

                 </exclusion> 

                 <exclusion> 

                     <groupId>log4j</groupId> 

                     <artifactId>log4j</artifactId> 

                 </exclusion> 

             </exclusions> 

         </dependency> 


     </dependencies> 

 </project> 

 6、编译后把集群提交到storm 

 storm jar stormkafkademo-0.0.1-SNAPSHOT.jar com.ljh.demo.stormkafkademo.MyKafkaTopology 192.168.169.91 ksh test3 


 7、在kafka-console-producer中输入内容,并查看结果 

 (1)在192.168.169.91:8080中查看数据的发送850712.png以及是否存在错误 

 (2)到hdfs中查看内容 

 hadoop fs -ls /storm,并cat相应的文件 

 输出结果样例: 

 as      292 

 possible.       64 

 reasons,        32 

 tend    32 

 skip    32 

 OS      63 

 packaged        32 

 it      64 

 a       63 

 to      224 

 try     98 

 put     32 

 in      32 

 OS      64 

 hierarchy,      32 

 can     32 


六、一些问题: 


 1、配置kafka时,如果使用zookeeper create /kafka创建了节点,kafka与storm集成时new ZkHosts(zks) 需要改成 new ZkHosts(zks,”/kafka/brokers”),不然会报java.lang.RuntimeException: java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /brokers/topics/my-replicated-topic5/partitions。 


 storm-kafka插件默认kafka的 zk_path如下: 

 public class ZkHosts implements BrokerHosts { 

 private static final String DEFAULT_ZK_PATH = “/brokers”; 


 2、如果出现以下问题,代表偏移量出错,建议重新开一个topic 

 ERROR [KafkaApi-3] Error when processing fetch request for partition [xxxxx,1] offset 112394 from consumer with correlation id 0 (kafka.server.KafkaApis) 

 kafka.common.OffsetOutOfRangeException: Request for offset 112394 but we only have log segments in the range 0 to 665.   


 3、在页面查看状态时,当在producer生产消息后,页面会有一定的延迟,应该是有缓存,缓存20个字节后一次提交,因此可以尝试一次大量提交word。 


 4、当没有某个topic,或者是某个topic的node放置不在默认位置时,会有以下异常: 

 java.lang.RuntimeException: java.lang.RuntimeException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /kafka/brokers/topics/mytest/partitions at storm.kafka.Dynam




5、关于日志
在初次运行程序时,可能会出现各种各样的错误,一般错误均可在日志中发现,在本例中,需要重点关注的日志有:
(1)supervisor上的work日志,位于$STORM_HOME/logs,如果集群正常,但某个topology运行出现错误,一般可以在这些work日志中找到问题。最常见的是CLASSNOTFOUNDEXCEPTION, CLASSNOTDEFINDEXCEPTION,都是缺包导致的,将它们放入$STORM_HOME/lib即可。
(2)nimbus上的日志,位于$STORM_HOME/logs,主要观察整个集群的状态,有以下4个文件
access.log  metrics.log  nimbus.log  ui.log
(3)kafka的日志,位于$KAFKA_HOME/logs,观察kafka是否运行正常。

6.关于emit与transfer(转自http://www.reader8.cn/jiaocheng/20120801/2057699.html)
 storm ui上emit和transferred的区别
最开始对storm ui上展示出来的emit和transferred数量不是很明白, 于是在storm-user上google了一把, 发现有人也有跟我一样的困惑, nathan做了详细的回答:

emitted栏显示的数字表示的是调用OutputCollector的emit方法的次数.

transferred栏显示的数字表示的是实际tuple发送到下一个task的计数.

如果一个bolt A使用all group的方式(每一个bolt都要接收到)向bolt B发射tuple, 此时bolt B启动了5个task, 那么trasferred显示的数量将是emitted的5倍.

如果一个bolt A内部执行了emit操作, 但是没有指定tuple的接受者, 那么transferred将为0.

这里还有关于spout, bolt之间的emitted数量的关系讨论, 也解释了我的一些疑惑:
有 的bolt的execture方法中并没有emit tuple, 但是storm ui中依然有显示emitted, 主要是因为它调用了ack方法, 而该方法将emit ack tuple到系统默认的acker bolt. 因此如果anchor方式emit一个tuple, emitted一般会包含向acker bolt发射tuple的数量.

另外collector.emit(new Values(xxx))和collector.emit(tuple, new Values(xxx)) 这两种不同的emit方法也会影响后面bolt的emitted和transferred, 如果是前者, 则后续bolt的这两个值都是0, 因为前一个emit方法是非安全的, 不再使用acker来进行校验.

7、kafka中出现以下异常:

[2015-06-09 17:03:05,094] ERROR [KafkaApi-0] Error processing ProducerRequest with correlation id 616 from client kafka-client on partition [test3,0] (kafka.server.KafkaApis) 

 kafka.common.MessageSizeTooLargeException: Message size is 2211366 bytes which exceeds the maximum configured message size of 1000012. 

 原因是集群默认每次只能接受约1M的消息,如果客户端一次发送的消息大于这个数值则会导致异常。 

 在server.properties中添加以下参数 

 message.max.bytes=1000000000 

 replica.fetch.max.bytes=1073741824 

 同时在consumer.properties中添加以下参数: 

 fetch.message.max.bytes=1073741824 

 然后重启kafka进程即可,现在每次最大可接收100M的消息。 


 8、修改kafka数据的默认放置位置: 

 server.properties中,将 

 log.dir=/tmp/kafka-los 

 改为 

 log.dir=/home/data/kafka 


 9、删除kafka中的topic 

 bin/kafka-topics.sh  --zookeeper 192.168.169.91:2181,192.168.169.92:2181,192.168.169.93:2181/kafka --delete --topic test2 


 10、 

 delete.topic.enable=true


默认为false,即delete topic时只是marked for deletion,但并不会真正删除topic。