之前写了《基于cdh570的phoenix编译》,开发环境测试没有问题,版本测试过程中发现问题,

每次建立SYSTEM.CATALOG必出错,每次SYSTEM.CATALOG对应的region都会处于regions in transition的状态。


经过仔细排查日志发现,是因为编译出来的包分为各工程单独的包,如

phoenix-server-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar

phoenix-core-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar

除了这些之外,还有一个组合包,在phoenix-assembly下面,
phoenix-4.7.1-HBase-1.2-cdh-SNAPSHOT-server.jar
phoenix-4.7.1-HBase-1.2-cdh-SNAPSHOT-client.jar

名字区别不大,导致我在放的时候只放了工程单独的包phoenix-server-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar,造成很多依赖的jar包找不到,引起报错。
具体排查过程如下。

1.首先排查hbase-regionserver 日志,因为我测试的时候是单region,所以很好定位,首先找到regionserver开始创建SYSTEM.CATALOG的地方,显示注册了几个协处理器,这里没有问题。

2017-08-03 19:12:49,356 INFO  [PriorityRpcServer.handler=3,queue=1,port=60020] regionserver.RSRpcServices: Open SYSTEM.CATALOG,,1501758768631.b2d6448c35c7ab874e8c653d0a1391bc.
2017-08-03 19:12:49,372 DEBUG [RS_OPEN_REGION-T-163:60020-0] zookeeper.ZKAssign: regionserver:60020-0x25d9cb3c31f0870, quorum=T-162:2181,T-163:2181,T-164:2181, baseZNode=/hbase Transitioning b2d6448c35c7ab874e8c653d0a1391bc from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2017-08-03 19:12:49,376 DEBUG [RS_OPEN_REGION-T-163:60020-0] zookeeper.ZKAssign: regionserver:60020-0x25d9cb3c31f0870, quorum=T-162:2181,T-163:2181,T-164:2181, baseZNode=/hbase Transitioned node b2d6448c35c7ab874e8c653d0a1391bc from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
2017-08-03 19:12:49,376 DEBUG [RS_OPEN_REGION-T-163:60020-0] regionserver.HRegion: Opening region: {ENCODED => b2d6448c35c7ab874e8c653d0a1391bc, NAME => 'SYSTEM.CATALOG,,1501758768631.b2d6448c35c7ab874e8c653d0a1391bc.', STARTKEY => '', ENDKEY => ''}
2017-08-03 19:12:49,377 DEBUG [RS_OPEN_REGION-T-163:60020-0] regionserver.HRegion: Registered coprocessor service: region=SYSTEM.CATALOG,,1501758768631.b2d6448c35c7ab874e8c653d0a1391bc. service=AuthenticationService
2017-08-03 19:12:49,377 INFO  [RS_OPEN_REGION-T-163:60020-0] coprocessor.CoprocessorHost: System coprocessor org.apache.hadoop.hbase.security.token.TokenProvider was loaded successfully with priority (536870911).
2017-08-03 19:12:49,377 DEBUG [RS_OPEN_REGION-T-163:60020-0] regionserver.HRegion: Registered coprocessor service: region=SYSTEM.CATALOG,,1501758768631.b2d6448c35c7ab874e8c653d0a1391bc. service=AccessControlService
2017-08-03 19:12:49,377 INFO  [RS_OPEN_REGION-T-163:60020-0] access.AccessController: A minimum HFile version of 3 is required to persist cell ACLs. Consider setting hfile.format.version accordingly.
2017-08-03 19:12:49,378 INFO  [RS_OPEN_REGION-T-163:60020-0] coprocessor.CoprocessorHost: System coprocessor org.apache.hadoop.hbase.security.access.AccessController was loaded successfully with priority (536870912).
2017-08-03 19:12:49,378 DEBUG [RS_OPEN_REGION-T-163:60020-0] coprocessor.CoprocessorHost: Loading coprocessor class org.apache.phoenix.coprocessor.MetaDataEndpointImpl with path null and priority 805306366
2017-08-03 19:12:49,472 DEBUG [RS_OPEN_REGION-T-163:60020-0] regionserver.HRegion: Registered coprocessor service: region=SYSTEM.CATALOG,,1501758768631.b2d6448c35c7ab874e8c653d0a1391bc. service=MetaDataService

2.顺着log往下排查,发现报错,显示jar包找不到的问题,
ABORTING region server t-163,60020,1501754933827: The coprocessor org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver threw java.io.IOException: No jar path specified for org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver
java.io.IOException: No jar path specified for org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver

通过查看phoenix源码发现,该类在phoenix-core工程里,遂添加phoenix-core-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar到hbase的lib下面。

2017-08-03 18:09:10,569 FATAL [RS_OPEN_REGION-T-163:60020-0] regionserver.HRegionServer: ABORTING region server t-163,60020,1501754933827: The coprocessor org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver threw java.io.IOException: No jar path specified for org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver
java.io.IOException: No jar path specified for org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver
        at org.apache.hadoop.hbase.coprocessor.CoprocessorHost.load(CoprocessorHost.java:190)
        at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.loadTableCoprocessors(RegionCoprocessorHost.java:364)
        at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.<init>(RegionCoprocessorHost.java:226)
        at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:720)
        at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:628)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:6128)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6432)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6404)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6360)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6311)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

3.增加jar包后,依然报错,提示java.lang.ClassNotFoundException: co.cask.tephra.Transaction;该类为phoenix依赖的tephra工程中的一个类,编译的时候因为有代码变更,所以这个也是我单独编译的。

通过这两次的报错,我明白了我提供的jar包,没有包含依赖,导致实际使用过程中持续报错。再次查看官方部署文档,发现官方所说需要部署的是phoenix--server.jar,而不是phoenix-server-.jar。

所以,实际需要部署的是phoenix-4.7.1-HBase-1.2-cdh-SNAPSHOT-server.jar

2017-08-03 19:12:51,537 ERROR [RS_OPEN_REGION-T-163:60020-0] handler.OpenRegionHandler: Failed open of region=SYSTEM.CATALOG,,1501758768631.b2d6448c35c7ab874e8c653d0a1391bc., starting to roll back the global memstore size.
java.lang.IllegalStateException: Could not instantiate a region instance.
        at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:6131)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6432)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6404)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6360)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6311)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
        at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.hbase.regionserver.HRegion.newHRegion(HRegion.java:6128)
        ... 10 more
Caused by: java.lang.NoClassDefFoundError: co/cask/tephra/Transaction
        at java.lang.Class.getDeclaredMethods0(Native Method)
        at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
        at java.lang.Class.getDeclaredMethod(Class.java:2128)
        at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.<init>(RegionCoprocessorHost.java:244)
        at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:720)
        at org.apache.hadoop.hbase.regionserver.HRegion.<init>(HRegion.java:628)
        ... 15 more
Caused by: java.lang.ClassNotFoundException: co.cask.tephra.Transaction
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        ... 21 more

4.
因为我手里既有官方cdh570的编译包,也有自己编译的snapshot包,所以我都做了兼容性测试。
测试结果和之前推测的一致。

官方的jar包和我自己打的snapshot的包都可以测试通过,且可以交换使用。详细结果如下:

server

client

兼容性测试

phoenix-4.7.0-clabs-phoenix1.3.0-server.jar

phoenix-4.7.0-clabs-phoenix1.3.0-client.jar

通过

phoenix-4.7.1-HBase-1.2-cdh-SNAPSHOT-server.jar

phoenix-4.7.1-HBase-1.2-cdh-SNAPSHOT-client.jar

通过

phoenix-4.7.1-HBase-1.2-cdh-SNAPSHOT-server.jar

phoenix-4.7.0-clabs-phoenix1.3.0-client.jar

todo

phoenix-4.7.0-clabs-phoenix1.3.0-server.jar

phoenix-4.7.1-HBase-1.2-cdh-SNAPSHOT-client.jar

通过

5.
比较奇怪的是最开始的开发测试环境171,部署了下面三个jar包。
phoenix-4.8.2-HBase-1.2-server.jar
phoenix-server-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar
phoenix-core-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar

这三个jar包组合在后续的测试过程中没有测试通过,但是171集群却一直可以,重启也没有问题
初步猜测原因可能为
1.是因为用其他的包提前建立了phoenix表。
2.用了4.8.2的大包,解决了依赖的问题。删了大包测试,171也不行了

171集群还有一个奇怪的地方在于,我只在171节点部署了三个jar包,其他节点只部署了phoenix-server-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar,phoenix-core-4.7.1-HBase-1.2-cdh-SNAPSHOT.jar;
这样的组合测试也正常,初步估计是hbase的动态调整策略,将需要phoenix配合的都放在了171上。
查看hbase 60010,验证此猜测。