最简单的wordcount程序
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
加上in-mapper-combine:
https://jinwooooo.github.io/jinwooooo-blog/hadoop-in-mapper-combiner/
hadoop坑:
Text不能代替String作为map的key:Text作为key,判断是是地址值
因此不管是用set的方式还是每次new,都无法实现用string作key的语义
Map<Text, Integer> count = new HashMap<Text, Integer>();
private Text word = new Text();
// local aggregate
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
if(count.containsKey(word)) { // 所有的都会走这个分支,因为word的地址不变
count.put(word, (int) count.get(word) + 1);
}
else {
count.put(word, 1);
}
}
}
这才可以:
Map<String, Integer> count = new HashMap<String, Integer>();
// local aggregate
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
String wd = itr.nextToken();
if(count.containsKey(wd)) {
count.put(wd, (int) count.get(wd) + 1);
}
else {
count.put(wd, 1);
}
}
}