1、分发HDFS压缩文件(-cacheArchive)
需求:wordcount(只统计指定的单词【the,and,had...】),但是该文件存储在HDFS上的压缩文件,压缩文件内可能有多个文件,通过-cacheArchive的方式进行分发;
-cacheArchive hdfs://host:port/path/to/file.tar.gz#linkname.tar.gz #选项在计算节点上缓存文件,streaming程序通过./linkname.tar.gz的方式访问文件。
思路:reducer程序都不需要修改,mapper需要增加用来读取压缩文件的函数(或模块),运行streaming的时候需要使用-cacheArchive 指定hdfs上的文件;
1.1、 streaming命令格式(-cacheArchive)
$HADOOP_HOME/bin/hadoop jar hadoop-streaming.jar \
-jobconf mapred.job.name="streaming_cacheArchive_demo" \
-jobconf mapred.job.priority=3 \
-jobconf mapred.compress.map.output=true \
-jobconf mapred.map.output.compression_codec=org.apache.hadoop.io.compress.GzipCodec \
-jobconf mapred.output.compress=true \
-jobconf mapred.out.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-input /input/ \
-output /output/ \
-mapper "python mapper.py whc.tar.gz" \
-reducer "python reducer.py" \
-cacheArchive "hdfs://master:9000/cache_file/wordwhite.tar.gz#whc.tar.gz"
-file ./mapper.py \
-file ./reducer.py
1.2、mapper程序
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import os
import os.path
import sys
def getCachefile(filename):
filelist = []
if os.path.isdir(filename):
for root, dirs, files, in os.walk(filename):
for name in files:
filepath = root + '/' + name
filelist.append(filepath)
return filelist
def readWordwhite(filename):
wordset = set()
for cachefile in getCachefile(filename):
with open(cachefile, 'r') as fd:
for line in fd:
word = line.strip()
wordset.add(word)
return wordset
def mapper(filename):
wordset = readWordwhite(filename)
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
if word != "" and (word in wordset):
print "%s\t%s" %(word, 1)
if __name__ == "__main__":
if sys.argv[1]:
file_fd = sys.argv[1]
mapper(file_fd)
1.3、 reducer程序
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import sys
def reducer():
currentword = None
wordsum = 0
for line in sys.stdin:
wordlist = line.strip().split('\t')
if len(wordlist) < 2:
continue
word = wordlist[0].strip()
wordvalue = wordlist[1].strip()
if currentword == None:
currentword = word
if currentword != word:
print "%s\t%s" %(currentword, str(wordsum))
currentword = word
wordsum = 0
wordsum += int(wordvalue)
print "%s\t%s" %(currentword, str(wordsum))
if __name__ == "__main__":
reducer()
1.4、上传wordwhite.tar.gz
$ ls -R wordwhite
wordwhite:
wordwhite01 wordwhite02 wordwhite03
$ cat wordwhite/wordwhite0*
have
and
had
the
in
this
or
this
to
$ tar zcf wordwhite.tar.gz wordwhite
$ hadoop fs -put wordwhite.tar.gz hdfs://localhost:9000/input/cachefile/
1.5、 run_streaming程序
#!/bin/bash
HADOOP_CMD="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/bin/hadoop"
STREAM_JAR_PATH="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.13.0.jar"
INPUT_FILE_PATH="/input/The_Man_of_Property"
OUTPUT_FILE_PATH="/output/wordcount/WordwhiteCacheArchiveFiletest"
$HADOOP_CMD fs -rmr -skipTrash $OUTPUT_FILE_PATH
$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH \
-output $OUTPUT_FILE_PATH \
-jobconf "mapred.job.name=wordcount_wordwhite_cacheArchivefile_demo" \
-mapper "python mapper.py WHF.gz" \
-reducer "python reducer.py" \
-cacheArchive "hdfs://localhost:9000/input/cachefile/wordwhite.tar.gz#WHF.gz" \
-file "./mapper.py" \
-file "./reducer.py"
1.6、执行程序
$ chmod +x run_streaming.sh
$ ./run_streaming.sh
rmr: DEPRECATED: Please use 'rm -r' instead.
Deleted /output/wordcount/WordwhiteCacheArchiveFiletest
18/02/01 17:57:00 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
18/02/01 17:57:00 WARN streaming.StreamJob: -cacheArchive option is deprecated, please use -archives instead.
18/02/01 17:57:00 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
18/02/01 17:57:00 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-unjar211766205758273068/] [] /tmp/streamjob9043244899616176268.jar tmpDir=null
18/02/01 17:57:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/02/01 17:57:01 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/02/01 17:57:03 INFO mapred.FileInputFormat: Total input paths to process : 1
18/02/01 17:57:03 INFO mapreduce.JobSubmitter: number of splits:2
18/02/01 17:57:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516345010544_0030
18/02/01 17:57:04 INFO impl.YarnClientImpl: Submitted application application_1516345010544_0030
18/02/01 17:57:04 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1516345010544_0030/
18/02/01 17:57:04 INFO mapreduce.Job: Running job: job_1516345010544_0030
18/02/01 17:57:11 INFO mapreduce.Job: Job job_1516345010544_0030 running in uber mode : false
18/02/01 17:57:11 INFO mapreduce.Job: map 0% reduce 0%
18/02/01 17:57:20 INFO mapreduce.Job: map 50% reduce 0%
18/02/01 17:57:21 INFO mapreduce.Job: map 100% reduce 0%
18/02/01 17:57:27 INFO mapreduce.Job: map 100% reduce 100%
18/02/01 17:57:28 INFO mapreduce.Job: Job job_1516345010544_0030 completed successfully
18/02/01 17:57:28 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=113911
FILE: Number of bytes written=664972
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=636501
HDFS: Number of bytes written=68
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=12584
Total time spent by all reduces in occupied slots (ms)=4425
Total time spent by all map tasks (ms)=12584
Total time spent by all reduce tasks (ms)=4425
Total vcore-milliseconds taken by all map tasks=12584
Total vcore-milliseconds taken by all reduce tasks=4425
Total megabyte-milliseconds taken by all map tasks=12886016
Total megabyte-milliseconds taken by all reduce tasks=4531200
Map-Reduce Framework
Map input records=2866
Map output records=14734
Map output bytes=84437
Map output materialized bytes=113917
Input split bytes=198
Combine input records=0
Combine output records=0
Reduce input groups=8
Reduce shuffle bytes=113917
Reduce input records=14734
Reduce output records=8
Spilled Records=29468
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=390
CPU time spent (ms)=3660
Physical memory (bytes) snapshot=713809920
Virtual memory (bytes) snapshot=8331399168
Total committed heap usage (bytes)=594018304
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=636303
File Output Format Counters
Bytes Written=68
18/02/01 17:57:28 INFO streaming.StreamJob: Output directory: /output/wordcount/WordwhiteCacheArchiveFiletest
1.7、 查看结果
$ hadoop fs -ls /output/wordcount/WordwhiteCacheArchiveFiletest
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2018-02-01 17:57 /output/wordcount/WordwhiteCacheArchiveFiletest/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 68 2018-02-01 17:57 /output/wordcount/WordwhiteCacheArchiveFiletest/part-00000
$ hadoop fs -text /output/wordcount/WordwhiteCacheArchiveFiletest/part-00000
and 2573
had 1526
have 350
in 1694
or 253
the 5144
this 412
to 2782
以上就完成了分发HDFS上的压缩文件并指定单词的wordcount.
2、hadoop streaming 语法参考
本文转自 巴利奇 51CTO博客,原文链接:http://blog.51cto.com/balich/2067858