例子1:时间序列数据(股票价格)
对于如下的收盘价序列数据:
时间序列 日期 收盘价
1 2013-10-01 10
2 2013-10-02 18
3 2013-10-03 20
4 2013-10-04 30
5 2013-10-07 24
6 2013-10-08 33
7 2013-10-09 27
要计算3天的移动平均数
时间序列 日期 移动平均 如何计算
1 2013-10-01 10.00 =(10)/(1)
2 2013-10-02 14.00 = (10+18)/(2)
3 2013-10-03 16.00 =(10+18+20)/(3)
4 2013-10-04 22.66 =(18+20+30)/(4)
例子2:时间序列数据(URL访问数)
计算一个特定时间窗口内各个日期访问不同URL的不同访问者人数的移动平均数。
URL 日期 不同访问者人数
------------------------
URL1 2013-10-01 400
URL1 2013-10-02 200
URL1 2013-10-03 300
URL1 2013-10-04 700
URL1 2013-10-05 800
URL2 2013-10-01 10
3天的URL访问数的移动平均数
URL 日期 移动平均数
-----------------------
URL1 2013-10-01 400
URL1 2013-10-02 200
URL1 2013-10-03 300
URL1 2013-10-04 700
URL1 2013-10-05 800
URL2 2013-10-01 10
一、POJO移动平均解决方案
解决方案1:使用队列
维护一个特定窗口大小的队列和一个累加和sum
对于每一个元素,先将其值累加到sum中并将其加入队尾
如果加入该元素后队列的大小没有超过特定窗口大小,则继续处理下一个元素
如果加入该元素后队列的大小超过了特定窗口大小,则将队首元素移除,【同时将sum减去队首元素的值】,这样可以保证累加和进行滑动
...
移动平均的计算,当队列不为空时,移动平均=累加和/队列大小
package yidongpingjun.pojo;
import java.util.Queue;
import java.util.LinkedList;
/**
* Simple moving average by using a queue data structure.
*
* @author Mahmoud Parsian
*
*/
public class SimpleMovingAverage {
private double sum = 0.0;
private final int period;
private final Queue<Double> window = new LinkedList<Double>();
public SimpleMovingAverage(int period) {
if (period < 1) {
throw new IllegalArgumentException("period must be > 0");
}
this.period = period;
}
public void addNewNumber(double number) {
sum += number;
window.add(number);
if (window.size() > period) {
sum -= window.remove();
}
}
public double getMovingAverage() {
if (window.isEmpty()) {
throw new IllegalArgumentException("average is undefined");
}
return sum / window.size();
}
}
解决方案2:使用数组
使用一个简单数组模拟入队和出队操作。但因为使用Java的队列数据结构时使用到了链表,没有使用数组直接存取高效
需要定义一个变量作为类似指针,记录队首的位置。
package yidongpingjun.pojo;
/**
* Simple moving average by using an array data structure.
*
* @author Mahmoud Parsian
*
*/
public class SimpleMovingAverageUsingArray {
private double sum = 0.0;
private final int period;
private double[] window = null;
private int pointer = 0;
private int size = 0;
public SimpleMovingAverageUsingArray(int period) {
if (period < 1) {
throw new IllegalArgumentException("period must be > 0");
}
this.period = period;
window = new double[period];
}
public void addNewNumber(double number) {
sum += number;
if (size < period) {
window[pointer++] = number;
size++;
}
else {
// size = period (size cannot be > period)
pointer = pointer % period;
sum -= window[pointer];
window[pointer++] = number;
}
}
public double getMovingAverage() {
if (size == 0) {
throw new IllegalArgumentException("average is undefined");
}
return sum / size;
}
}
测试主程序:
package yidongpingjun.pojo;
import org.apache.log4j.Logger;
import org.apache.log4j.BasicConfigurator;
/**
* Basic testing of Simple moving average.
*
* @author Mahmoud Parsian
*
*/
public class TestSimpleMovingAverage {
private static final Logger THE_LOGGER = Logger.getLogger(TestSimpleMovingAverage.class);
public static void main(String[] args) {
// The invocation of the BasicConfigurator.configure method
// creates a rather simple log4j setup. This method is hardwired
// to add to the root logger a ConsoleAppender.
BasicConfigurator.configure();
// time series 1 2 3 4 5 6 7
double[] testData = {10, 18, 20, 30, 24, 33, 27};
int[] allWindowSizes = {3, 4};
for (int windowSize : allWindowSizes) {
SimpleMovingAverage sma = new SimpleMovingAverage(windowSize);
THE_LOGGER.info("windowSize = " + windowSize);
for (double x : testData) {
sma.addNewNumber(x);
THE_LOGGER.info("Next number = " + x + ", SMA = " + sma.getMovingAverage());
}
THE_LOGGER.info("---");
}
}
}
二、MapReduce/Hadoop移动平均解决方案
输入:<name-as-string><,><date-as-timestamp><,><value-as-double>
GOOD,2004-11-04,184.70
GOOD,2014-11-03,191.67
GOOD,2014-11-02,194.87
AAPL,2013-10-09,486.59
AAPL,2013-10-08,480.94
AAPL,2013-10-07,487.75
AAPL,2013-10-04,483.03
AAPL,2013-10-03,483.41
IBM,2013-09-30,185.18
IBM,2013-09-27,186.92
IBM,2013-09-26,190.22
IBM,2013-09-25,189.47
GOOD,2013-07-19,896.60
GOOD,2013-07-19,910.68
GOOD,2013-07-17,918.55
输出:<name-as-string><,><date-as-timestamp><,><moving-average-as-double>
只需要根据股票代码对数据分组,然后按时间戳对这些值排序,然后应用移动平均算法。
对时间序列数据进行排序至少有两种方法:
解决方案1:在内存中排序
新建一个数据结构TimeSeriesData,将时间date和收盘价value绑定在一起
先对每一行做map操作,将其映射为(name,新建一个数据结构TimeSeriesData)的键值对
reduce操作中,所有name相同的键值对会到达同一个reduce,其key为name,value为无序的TimeSeriesData集合,在这里将这个集合在内存中进行按时间排序
然后对排序后的集合进行移动平均,生成key为股票代码,value为时间和移动平均的键值对集合,并写入输出文件中
package yidongpingjun;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.text.SimpleDateFormat;
import org.apache.hadoop.io.Writable;
/**
*
* TimeSeriesData represents a pair of
* (time-series-timestamp, time-series-value).
*
* @author Mahmoud Parsian
*
*/
public class TimeSeriesData
implements Writable, Comparable<TimeSeriesData> {
private long timestamp;
private double value;
public static TimeSeriesData copy(TimeSeriesData tsd) {
return new TimeSeriesData(tsd.timestamp, tsd.value);
}
public TimeSeriesData(long timestamp, double value) {
set(timestamp, value);
}
public TimeSeriesData() {
}
public void set(long timestamp, double value) {
this.timestamp = timestamp;
this.value = value;
}
public long getTimestamp() {
return this.timestamp;
}
public double getValue() {
return this.value;
}
/**
* Deserializes the point from the underlying data.
* @param in a DataInput object to read the point from.
*/
public void readFields(DataInput in) throws IOException {
this.timestamp = in.readLong();
this.value = in.readDouble();
}
/**
* Convert a binary data into TimeSeriesData
*
* @param in A DataInput object to read from.
* @return A TimeSeriesData object
* @throws IOException
*/
public static TimeSeriesData read(DataInput in) throws IOException {
TimeSeriesData tsData = new TimeSeriesData();
tsData.readFields(in);
return tsData;
}
public String getDate() {
return DateUtil.getDateAsString(this.timestamp);
}
/**
* Creates a clone of this object
*/
public TimeSeriesData clone() {
return new TimeSeriesData(timestamp, value);
}
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(this.timestamp );
out.writeDouble(this.value );
}
/**
* Used in sorting the data in the reducer
*/
@Override
public int compareTo(TimeSeriesData data) {
if (this.timestamp < data.timestamp ) {
return -1;
}
else if (this.timestamp > data.timestamp ) {
return 1;
}
else {
return 0;
}
}
public String toString() {
return "("+timestamp+","+value+")";
}
}
package yidongpingjun.memorysort;
import java.util.Date;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.commons.lang.StringUtils;
import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;
/***
*
* @author chenjie
*输入:
*GOOG,2004-11-04,184.70
GOOG,2004-11-03,191.67
GOOG,2004-11-02,194.87
AAPL,2013-10-09,486.59
AAPL,2013-10-08,480.94
AAPL,2013-10-07,487.75
AAPL,2013-10-04,483.03
AAPL,2013-10-03,483.41
IBM,2013-09-30,185.18
IBM,2013-09-27,186.92
IBM,2013-09-26,190.22
IBM,2013-09-25,189.47
GOOG,2013-07-19,896.60
GOOG,2013-07-18,910.68
GOOG,2013-07-17,918.55
*
*
*/
public class SortInMemory_MovingAverageMapper
extends Mapper<LongWritable, Text, Text, TimeSeriesData> {
private final Text reducerKey = new Text();
private final TimeSeriesData reducerValue = new TimeSeriesData();
/**
* value:GOOG,2004-11-04,184.70
*/
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String record = value.toString();
if ((record == null) || (record.length() == 0)) {
return;
}
String[] tokens = StringUtils.split(record.trim(), ",");
if (tokens.length == 3) {
Date date = DateUtil.getDate(tokens[1]);//2004-11-04,
if (date == null) {
return;
}
reducerKey.set(tokens[0]); // GOOG
reducerValue.set(date.getTime(), Double.parseDouble(tokens[2]));
context.write(reducerKey, reducerValue);
}
else {
// log as error, not enough tokens
}
}
}
package yidongpingjun.memorysort;
import java.io.IOException;
import java.util.List;
import java.util.ArrayList;
import java.util.Collections;
//
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
//
import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;
public class SortInMemory_MovingAverageReducer
extends Reducer<Text, TimeSeriesData, Text, Text> {
int windowSize = 5; // default window size
/**
* will be run only once
* get parameters from Hadoop's configuration
*/
public void setup(Context context)
throws IOException, InterruptedException {
this.windowSize = context.getConfiguration().getInt("moving.average.window.size", 5);
System.out.println("setup(): key="+windowSize);
}
public void reduce(Text key, Iterable<TimeSeriesData> values, Context context)
throws IOException, InterruptedException {
System.out.println("reduce(): key="+key.toString());
// build the unsorted list of timeseries
List<TimeSeriesData> timeseries = new ArrayList<TimeSeriesData>();
for (TimeSeriesData tsData : values) {
TimeSeriesData copy = TimeSeriesData.copy(tsData);
timeseries.add(copy);
}
// sort the timeseries data in memory and
// apply moving average algorithm to sorted timeseries
Collections.sort(timeseries);
System.out.println("reduce(): timeseries="+timeseries.toString());
// calculate prefix sum
double sum = 0.0;
for (int i=0; i < windowSize-1; i++) {
sum += timeseries.get(i).getValue();
}
// now we have enough timeseries data to calculate moving average
Text outputValue = new Text(); // reuse object
for (int i = windowSize-1; i < timeseries.size(); i++) {
System.out.println("reduce(): key="+key.toString() + " i="+i);
sum += timeseries.get(i).getValue();
double movingAverage = sum / windowSize;
long timestamp = timeseries.get(i).getTimestamp();
outputValue.set(DateUtil.getDateAsString(timestamp) + "," + movingAverage);
// send output to HDFS
context.write(key, outputValue);
// prepare for next iteration
sum -= timeseries.get(i-windowSize+1).getValue();
}
} // reduce
}
测试驱动类
package yidongpingjun.memorysort;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
//
import yidongpingjun.HadoopUtil;
import yidongpingjun.TimeSeriesData;
/**
* MapReduce job for moving averages of time series data
* by using in memory sort (without secondary sort).
*
* @author Mahmoud Parsian
*
*/
public class SortInMemory_MovingAverageDriver {
private static final String INPATH = "input/gupiao1.txt";// 输入文件路径
private static final String OUTPATH = "output/gupiao1";// 输出文件路径
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new String[3];
otherArgs[0] = "2";
otherArgs[1] = INPATH;
otherArgs[2] = OUTPATH;
if (otherArgs.length != 3) {
System.err.println("Usage: SortInMemory_MovingAverageDriver <window_size> <input> <output>");
System.exit(1);
}
System.out.println("args[0]: <window_size>="+otherArgs[0]);
System.out.println("args[1]: <input>="+otherArgs[1]);
System.out.println("args[2]: <output>="+otherArgs[2]);
Job job = new Job(conf, "SortInMemory_MovingAverageDriver");
// add jars to distributed cache
// HadoopUtil.addJarsToDistributedCache(job, "/lib/");
// set mapper/reducer
job.setMapperClass(SortInMemory_MovingAverageMapper.class);
job.setReducerClass(SortInMemory_MovingAverageReducer.class);
// define mapper's output key-value
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(TimeSeriesData.class);
// define reducer's output key-value
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// set window size for moving average calculation
int windowSize = Integer.parseInt(otherArgs[0]);
job.getConfiguration().setInt("moving.average.window.size", windowSize);
// define I/O
FileInputFormat.addInputPath(job, new Path(otherArgs[1]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
输出结果:
AAPL 2013-10-04,483.22 AAPL 2013-10-07,485.39 AAPL 2013-10-08,484.345 AAPL 2013-10-09,483.765 GOOG 2004-11-03,193.26999999999998 GOOG 2004-11-04,188.18499999999997 GOOG 2013-07-17,551.625 GOOG 2013-07-18,914.615 GOOG 2013-07-19,903.6400000000001 IBM 2013-09-26,189.845 IBM 2013-09-27,188.57 IBM 2013-09-30,186.05
解决方案2:使用MapReduce框架排序(二次排序),使用股票名词和时间戳构成组合键,按股票名称进行分组,按照股票名称和时间戳排序。
新建一个数据结构TimeSeriesData,将时间date和收盘价value绑定在一起
新建一个数据结构CompositeKey,作为组合键,将股票代码和时间绑定在一起
映射器类SortByMRF_MovingAverageMapper,将输入【股票代码,时间,收盘价】映射为key为CompositeKey,value为TimeSeriesData的键值对
既然key和value都变为了自定义复杂类型,那么如何根据key进行分区和排序,如何根据value进行排序,都需要自己定义
于是,新建一个数据结构CompositeKeyComparator,定义key如何进行排序:先按CompositeKey的股票代码进行排序,再按时间进行排序
新建一个数据结构NaturalKeyPartitioner,定义key如何进行分区:按照CompositeKey的股票代码进行分区,使得股票代码相同的记录能够到达同一个规约器reducer
新建一个数据结构NaturalKeyGroupingComparator,定义key如何进行分组:按照CompositeKey的股票代码进行分组
新建一个数据结构SortByMRF_MovingAverageReducer,定义如何进行规约:对于key为CompositeKey,value为根据时间排序的有序TimeSeriesData集合,计算移动平均
package yidongpingjun;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.text.SimpleDateFormat;
import org.apache.hadoop.io.Writable;
public class TimeSeriesData
implements Writable, Comparable<TimeSeriesData> {
private long timestamp;
private double value;
public static TimeSeriesData copy(TimeSeriesData tsd) {
return new TimeSeriesData(tsd.timestamp, tsd.value);
}
public TimeSeriesData(long timestamp, double value) {
set(timestamp, value);
}
public TimeSeriesData() {
}
public void set(long timestamp, double value) {
this.timestamp = timestamp;
this.value = value;
}
public long getTimestamp() {
return this.timestamp;
}
public double getValue() {
return this.value;
}
/**
* Deserializes the point from the underlying data.
* @param in a DataInput object to read the point from.
*/
public void readFields(DataInput in) throws IOException {
this.timestamp = in.readLong();
this.value = in.readDouble();
}
/**
* Convert a binary data into TimeSeriesData
*
* @param in A DataInput object to read from.
* @return A TimeSeriesData object
* @throws IOException
*/
public static TimeSeriesData read(DataInput in) throws IOException {
TimeSeriesData tsData = new TimeSeriesData();
tsData.readFields(in);
return tsData;
}
public String getDate() {
return DateUtil.getDateAsString(this.timestamp);
}
/**
* Creates a clone of this object
*/
public TimeSeriesData clone() {
return new TimeSeriesData(timestamp, value);
}
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(this.timestamp );
out.writeDouble(this.value );
}
/**
* Used in sorting the data in the reducer
*/
@Override
public int compareTo(TimeSeriesData data) {
if (this.timestamp < data.timestamp ) {
return -1;
}
else if (this.timestamp > data.timestamp ) {
return 1;
}
else {
return 0;
}
}
public String toString() {
return "("+timestamp+","+value+")";
}
}
package yidongpingjun.secondarysort;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
//
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class CompositeKey implements WritableComparable<CompositeKey> {
// natural key is (name)
// composite key is a pair (name, timestamp)
private String name;
private long timestamp;
public CompositeKey(String name, long timestamp) {
set(name, timestamp);
}
public CompositeKey() {
}
public void set(String name, long timestamp) {
this.name = name;
this.timestamp = timestamp;
}
public String getName() {
return this.name;
}
public long getTimestamp() {
return this.timestamp;
}
@Override
public void readFields(DataInput in) throws IOException {
this.name = in.readUTF();
this.timestamp = in.readLong();
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.name);
out.writeLong(this.timestamp);
}
@Override
public int compareTo(CompositeKey other) {
if (this.name.compareTo(other.name) != 0) {
return this.name.compareTo(other.name);
}
else if (this.timestamp != other.timestamp) {
return timestamp < other.timestamp ? -1 : 1;
}
else {
return 0;
}
}
public static class CompositeKeyComparator extends WritableComparator {
public CompositeKeyComparator() {
super(CompositeKey.class);
}
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
return compareBytes(b1, s1, l1, b2, s2, l2);
}
}
static { // register this comparator
WritableComparator.define(CompositeKey.class,
new CompositeKeyComparator());
}
}
package yidongpingjun.secondarysort;
import java.util.Date;
import java.io.IOException;
//
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.commons.lang.StringUtils;
//
import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;
public class SortByMRF_MovingAverageMapper extends MapReduceBase
implements Mapper<LongWritable, Text, CompositeKey, TimeSeriesData> {
// reuse Hadoop's Writable objects
private final CompositeKey reducerKey = new CompositeKey();
private final TimeSeriesData reducerValue = new TimeSeriesData();
@Override
public void map(LongWritable inkey, Text value,
OutputCollector<CompositeKey, TimeSeriesData> output,
Reporter reporter) throws IOException {
String record = value.toString();
if ((record == null) || (record.length() == 0)) {
return;
}
String[] tokens = StringUtils.split(record, ",");
if (tokens.length == 3) {
// tokens[0] = name of timeseries as string
// tokens[1] = timestamp
// tokens[2] = value of timeseries as double
Date date = DateUtil.getDate(tokens[1]);
if (date == null) {
return;
}
long timestamp = date.getTime();
reducerKey.set(tokens[0], timestamp);
reducerValue.set(timestamp, Double.parseDouble(tokens[2]));
// emit key-value pair
output.collect(reducerKey, reducerValue);
}
else {
// log as error, not enough tokens
}
}
}
package yidongpingjun.secondarysort;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class CompositeKeyComparator extends WritableComparator {
protected CompositeKeyComparator() {
super(CompositeKey.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
CompositeKey key1 = (CompositeKey) w1;
CompositeKey key2 = (CompositeKey) w2;
int comparison = key1.getName().compareTo(key2.getName());
if (comparison == 0) {
// names are equal here
if (key1.getTimestamp() == key2.getTimestamp()) {
return 0;
} else if (key1.getTimestamp() < key2.getTimestamp()) {
return -1;
} else {
return 1;
}
}
else {
return comparison;
}
}
}
package yidongpingjun.secondarysort;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
import yidongpingjun.TimeSeriesData;
public class NaturalKeyPartitioner implements
Partitioner<CompositeKey, TimeSeriesData> {
@Override
public int getPartition(CompositeKey key,
TimeSeriesData value,
int numberOfPartitions) {
return Math.abs((int) (hash(key.getName()) % numberOfPartitions));
}
@Override
public void configure(JobConf jobconf) {
}
/**
* adapted from String.hashCode()
*/
static long hash(String str) {
long h = 1125899906842597L; // prime
int length = str.length();
for (int i = 0; i < length; i++) {
h = 31 * h + str.charAt(i);
}
return h;
}
}
package yidongpingjun.secondarysort;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class NaturalKeyGroupingComparator extends WritableComparator {
protected NaturalKeyGroupingComparator() {
super(CompositeKey.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
CompositeKey key1 = (CompositeKey) w1;
CompositeKey key2 = (CompositeKey) w2;
return key1.getName().compareTo(key2.getName());
}
}
package yidongpingjun.secondarysort;
import java.util.Iterator;
import java.io.IOException;
//
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.JobConf;
//
import yidongpingjun.DateUtil;
import yidongpingjun.TimeSeriesData;
public class SortByMRF_MovingAverageReducer extends MapReduceBase
implements Reducer<CompositeKey, TimeSeriesData, Text, Text> {
int windowSize = 5; // default window size
/**
* will be run only once get parameters from Hadoop's configuration
*/
@Override
public void configure(JobConf jobconf) {
this.windowSize = jobconf.getInt("moving.average.window.size", 5);
}
@Override
public void reduce(CompositeKey key,
Iterator<TimeSeriesData> values,
OutputCollector<Text, Text> output,
Reporter reporter)
throws IOException {
// note that values are sorted.
// apply moving average algorithm to sorted timeseries
Text outputKey = new Text();
Text outputValue = new Text();
MovingAverage ma = new MovingAverage(this.windowSize);
while (values.hasNext()) {
TimeSeriesData data = values.next();
ma.addNewNumber(data.getValue());
double movingAverage = ma.getMovingAverage();
long timestamp = data.getTimestamp();
String dateAsString = DateUtil.getDateAsString(timestamp);
//THE_LOGGER.info("Next number = " + x + ", SMA = " + sma.getMovingAverage());
outputValue.set(dateAsString + "," + movingAverage);
outputKey.set(key.getName());
output.collect(outputKey, outputValue);
}
//
}
}
package yidongpingjun.secondarysort;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.JobClient;
//
import yidongpingjun.HadoopUtil;
import yidongpingjun.TimeSeriesData;
public class SortByMRF_MovingAverageDriver {
private static final String INPATH = "input/gupiao1.txt";// 输入文件路径
private static final String OUTPATH = "output/gupiao2";// 输出文件路径
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
JobConf jobconf = new JobConf(conf, SortByMRF_MovingAverageDriver.class);
jobconf.setJobName("SortByMRF_MovingAverageDriver");
String[] otherArgs = new String[3];
otherArgs[0] = "2";
otherArgs[1] = INPATH;
otherArgs[2] = OUTPATH;
if (otherArgs.length != 3) {
System.err.println("Usage: SortByMRF_MovingAverageDriver <window_size> <input> <output>");
System.exit(1);
}
// add jars to distributed cache
// HadoopUtil.addJarsToDistributedCache(conf, "/lib/");
// set mapper/reducer
jobconf.setMapperClass(SortByMRF_MovingAverageMapper.class);
jobconf.setReducerClass(SortByMRF_MovingAverageReducer.class);
// define mapper's output key-value
jobconf.setMapOutputKeyClass(CompositeKey.class);
jobconf.setMapOutputValueClass(TimeSeriesData.class);
// define reducer's output key-value
jobconf.setOutputKeyClass(Text.class);
jobconf.setOutputValueClass(Text.class);
// set window size for moving average calculation
int windowSize = Integer.parseInt(otherArgs[0]);
jobconf.setInt("moving.average.window.size", windowSize);
// define I/O
FileInputFormat.setInputPaths(jobconf, new Path(otherArgs[1]));
FileOutputFormat.setOutputPath(jobconf, new Path(otherArgs[2]));
jobconf.setInputFormat(TextInputFormat.class);
jobconf.setOutputFormat(TextOutputFormat.class);
jobconf.setCompressMapOutput(true);
// the following 3 setting are needed for "secondary sorting"
// Partitioner decides which mapper output goes to which reducer
// based on mapper output key. In general, different key is in
// different group (Iterator at the reducer side). But sometimes,
// we want different key in the same group. This is the time for
// Output Value Grouping Comparator, which is used to group mapper
// output (similar to group by condition in SQL). The Output Key
// Comparator is used during sort stage for the mapper output key.
jobconf.setPartitionerClass(NaturalKeyPartitioner.class);
jobconf.setOutputKeyComparatorClass(CompositeKeyComparator.class);
jobconf.setOutputValueGroupingComparator(NaturalKeyGroupingComparator.class);
JobClient.runJob(jobconf);
}
}
package yidongpingjun;
import java.text.SimpleDateFormat;
import java.util.Date;
public class DateUtil {
static final String DATE_FORMAT = "yyyy-MM-dd";
static final SimpleDateFormat SIMPLE_DATE_FORMAT =
new SimpleDateFormat(DATE_FORMAT);
/**
* Returns the Date from a given dateAsString
*/
public static Date getDate(String dateAsString) {
try {
return SIMPLE_DATE_FORMAT.parse(dateAsString);
}
catch(Exception e) {
return null;
}
}
/**
* Returns the number of milliseconds since January 1, 1970,
* 00:00:00 GMT represented by this Date object.
*/
public static long getDateAsMilliSeconds(Date date) throws Exception {
return date.getTime();
}
/**
* Returns the number of milliseconds since January 1, 1970,
* 00:00:00 GMT represented by this Date object.
*/
public static long getDateAsMilliSeconds(String dateAsString) throws Exception {
Date date = getDate(dateAsString);
return date.getTime();
}
public static String getDateAsString(long timestamp) {
return SIMPLE_DATE_FORMAT.format(timestamp);
}
}
package yidongpingjun;
import java.util.List;
import java.util.ArrayList;
import java.util.Arrays;
import java.io.IOException;
//
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.filecache.DistributedCache;
public class HadoopUtil {
/**
* Add all jar files to HDFS's distributed cache
*
* @param job job which will be run
* @param hdfsJarDirectory a directory which has all required jar files
*/
public static void addJarsToDistributedCache(Job job,
String hdfsJarDirectory)
throws IOException {
if (job == null) {
return;
}
addJarsToDistributedCache(job.getConfiguration(), hdfsJarDirectory);
}
/**
* Add all jar files to HDFS's distributed cache
*
* @param Configuration conf which will be run
* @param hdfsJarDirectory a directory which has all required jar files
*/
public static void addJarsToDistributedCache(Configuration conf,
String hdfsJarDirectory)
throws IOException {
if (conf == null) {
return;
}
FileSystem fs = FileSystem.get(conf);
List<FileStatus> jars = getDirectoryListing(hdfsJarDirectory, fs);
for (FileStatus jar : jars) {
Path jarPath = jar.getPath();
DistributedCache.addFileToClassPath(jarPath, conf, fs);
}
}
/**
* Get list of files from a given HDFS directory
* @param directory an HDFS directory name
* @param fs an HDFS FileSystem
*/
public static List<FileStatus> getDirectoryListing(String directory,
FileSystem fs)
throws IOException {
Path dir = new Path(directory);
FileStatus[] fstatus = fs.listStatus(dir);
return Arrays.asList(fstatus);
}
public static List<String> listDirectoryAsListOfString(String directory,
FileSystem fs)
throws IOException {
Path path = new Path(directory);
FileStatus fstatus[] = fs.listStatus(path);
List<String> listing = new ArrayList<String>();
for (FileStatus f: fstatus) {
listing.add(f.getPath().toUri().getPath());
}
return listing;
}
/**
* Return true, if HDFS path doers exist; otherwise return false.
*
*/
public static boolean pathExists(Path path, FileSystem fs) {
if (path == null) {
return false;
}
try {
return fs.exists(path);
}
catch(Exception e) {
return false;
}
}
}