hbase 单机容量限制

转载

墨染心语 2024-12-02 17:02:56

文章标签 hbase 单机容量限制 nutch solr apache java 文章分类 Hbase 数据库

Nutch起源于ApacheLucene项目，是一个可扩展和可伸缩的开源网络爬虫软件项目，包括两个版本的代码库，即： 1，Nutch1.x版本：一个成熟的产品化的爬虫。1.x版本依赖于Apache Hadoop的数据结构，并使用了细粒度配置。Hadoop对于批处理提供了很强大的功能。 2，Nutch2.x的版本：一个新兴的、直接受1.x版本启发的替代方案。该版本在存储的关键领域不与1.x版本同，新版本通过使用 Apache Gora™处理对象的持久映射使得存储从任何特定的底层数据存储分离出来。这意味着我们可以实现一个极其灵活多变的、用来存储任何东西的模型（抓取时间、状态、内容、分析的文本、外链接、内链接等）使其集成到许多NoSQL存储解决方案。 3，两个版本的主要区别在于底层的存储不同。1.x版本是基于Hadoop架构的，底层存储使用的是HDFS，而2.x通过使用Apache Gora，使得Nutch可以访问HBase、Accumulo、Cassandra、MySQL、DataFileAvroStore、AvroStore等NoSQL。

一，安装环境硬件：虚拟机操作系统：Centos 6.4 64位 IP：10.51.121.10 主机名：datanode-4 安装用户：root JDK：需要安装JDK1.7或者以上版本。这里安装的JDK为jdk1.7.0_75，并配置好了环境变量。 HBase：Nutch2.3版本官方文档中说对应的HBase版本为HBase0.94.14。Hbase0.94.14安装文档见： Solr：这里集成solr-4.10.3。 Ant：在安装Nutch2.3之前，需要安装Ant，并配置环境变量。这里安装的Ant为apache-ant-1.9.4，并配置好了环境变量。

二，安装Nutch2.3 1，下载地址：http://nutch.apache.org/downloads.html，这里下载apache-nutch-2.3-src.tar.gz 2，解压，执行#tar -zxvf apache-nutch-2.3-src.tar.gz，这里解压到/root/nutch目录下，则Nutch的安装目录为：/root/nutch/apache-nutch-2.3，下面的$NUTCH_HOME指/root/nutch/apache-nutch-2.3目录。 3，在$NUTCH_HOME/conf/nutch-site.xml文件中添加如下配置：

<property>
 <name>storage.data.store.class</name>
 <value>org.apache.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>

4，在$NUTCH_HOME/ivy/ivy.xml文件中找到如下配置：

<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />

确保此配置生效，即如果发现此配置有注释，去掉注释。注意：rev=0.5对应的Hbase版本是Hbase0.94.14，rev=0.3对应的Hbase版本是hbase0.90.4。

5，在 $NUTCH_HOME/conf/gora.properties文件中添加如下配置：

gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

6，在$NUTCH_HOME目录下，执行：#ant runtime 命令，编译。在编译最后如果有BUILD SUCCESSFUL，说明编译成功，若提示信息为BUILD FAILED，则说明编译失败，需要根据编译过程中输出的信息查找错误原因。编译完之后，会新增build和runtime两个目录。

三，安装Solr, 1，到http://archive.apache.org/dist/lucene/solr/ 下载对应版本的Solr。这里下载solr-4.10.3.tgz 注意，请下载Solr4版本，本人没有调试成功与Solr5的集成。 2，解压，执行:#tar -zxvf solr-4.10.3.tgz，到/root/nutch目录。则Solr安装目录为：/root/nutch/solr-4.10.3。下面${SOLR_HOME}指/root/nutch/solr-4.10.3。 3，进入/root/nutch/solr-4.10.3/example目录，执行：#java -jar start.jar，启动Solr。 4，访问，http://localhost:8983/solr/

四，集成Solr 1，先备份Solr example 的schema.xml。

#mv ${SOLR_HOME}/example/solr/collection1/conf/schema.xml ${SOLR_HOME}/example/solr/collection1/conf/schema.xml.bak

2，复制Nutch运行目录下的schema.xml到Solr example目录下。这里${NUTCH_RUNTIME_HOME}指/root/nutch/apache-nutch-2.3/runtime/local

#cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml  ${SOLR_HOME}/example/solr/collection1/conf/

3，在Solr4.10.3版本中，笔者没有对schema.xml文件做任何修改。集成低版本的Solr可能需要做适当的修改，详细请见：http://wiki.apache.org/nutch/NutchTutorial

4，进入/root/nutch/solr-4.10.3/example目录，执行：#java -jar start.jar，重新启动Solr。

五，配置爬虫信息 1，配置agent。在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml 文件中添加Agent信息：

<property>
 <name>http.agent.name</name>
 <value>JustinNutchAgent</value>
</property>

2，添加索引信息。在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml 文件中添加如下信息：

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>

3，配置目标要抓取的URL，这里在/root/nutch/apache-nutch-2.3/runtime/local目录下新建myUrls文件夹，并新增seed.txt文件，在此文件中添加如下信息：

http://www.10jqka.com.cn/
http://www.cnblogs.com/

这里抓取博客园和同花顺网站。

六，启动爬虫 1，先启动 Hbase，进入/root/hadoop/hbase-0.94.14/目录，执行#./bin/start-hbase.sh 脚本。 2，启动Solr，进入/root/nutch/solr-4.10.3/example目录，执行#java -jar start.jar 3，启动Nutch，开始抓取任务。进入/root/nutch/apache-nutch-2.3/runtime/local目录，执行./bin/crawl命令

Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
<seedDir>：放置种子文件的目录
<crawlID> ：抓取任务的ID
<solrURL>：用于索引及搜索的solr地址
<numberOfRounds>：迭代次数，即抓取深度

./bin/crawl ./myUrls/ mycrawl1 http://localhost:8983/solr/ 2

执行完之后，进入Solr的界面：http://10.51.121.10:8983/solr/#/collection1/query，有如下信息：

hbase 单机容量限制_hbase 单机容量限制

说明成功抓取信息，并在Solr中建立了索引，并可以在Solr中搜索到爬到的信息。

七，常见错误 1，在Fetch任务时，报如下错：

# ./bin/nutch fetch -all -crawlId mycrawl1 -threads 5

SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found 
 binding in 
 [jar:file:/root/nutch/apache-nutch-2.3/runtime/local/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] 
 SLF4J: Found binding in 
 [jar:file:/root/nutch/apache-nutch-2.3/runtime/local/lib/slf4j-log4j12-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] 
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
 explanation. SLF4J: Actual binding is of type 
 [org.slf4j.impl.Log4jLoggerFactory] FetcherJob: starting at 2015-03-09 
 17:05:53 FetcherJob: fetching all Fetcher: No agents listed in 
 ‘http.agent.name’ property. Exception in thread “main” 
 java.lang.IllegalArgumentException: Fetcher: No agents listed in 
 ‘http.agent.name’ property. 
 at org.apache.nutch.fetcher.FetcherJob.checkConfiguration(FetcherJob.java:273) 
 at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:159) 
 at org.apache.nutch.fetcher.FetcherJob.fetch(FetcherJob.java:254) 
 at org.apache.nutch.fetcher.FetcherJob.run(FetcherJob.java:317) 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) 
 at org.apache.nutch.fetcher.FetcherJob.main(FetcherJob.java:324)

原因：在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml文件中没有配置http.agent.name。需要在此文件中添加如下配置：

<property>
 <name>http.agent.name</name>
 <value>JustinNutchAgent</value>
</property>

2，执行# ./bin/nutch index solr.server.url=http://localhost:8983/solr -all -crawlId mycrawl1 报如下错：

IndexingJob: starting
No IndexWriters activated - check your configuration

IndexingJob: done.

原因：需要在nutch-site.xml文件中配置索引插件。在/root/nutch/apache-nutch-2.3/runtime/local/conf/nutch-site.xml文件中添加如下配置信息：

<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
</property>

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。