1 系统、软件以及前提约束
- CentOS-7 64
为减少linux权限对初学者造成影响,所有命令均在linux的root权限下进行操作。 - 已安装hadoop-2.5.2 https://www.jianshu.com/p/5707c5ccd85b
- CentOS7当中已经默认安装python3.7.3
2 操作步骤
- 创建mapper.py文件
#!/usr/bin/python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print ('%s\t%s' % (word, 1))
验证,执行以下语句:
echo aa bb cc dd aa cc|python mapper.py
得到以下结果:
- 创建reducer.py文件:
#!/usr/bin/python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print ('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print ('%s\t%s' % (current_word, current_count))
验证,执行以下语句:
echo aa bb cc dd aa cc|python mapper.py|sort|python reducer.py
得到以下结果:
- 创建一个文件info.txt,内容如下:
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc cc dd
- 上传该文件到HDFS的/data的info文件中
hdfs dfs -mkdir /data
hdfs dfs -put info.txt /data/info
- 执行以下命令,确保hdfs下/out99不存在
$HADOOP_HOME/bin/hadoop jar
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar
-input "/data/*"
-output "/out99"
-mapper "python mapper.py"
-reducer "python reducer.py"
-file "/root/mapper.py"
-file "/root/reducer.py"
注意:$HADOOP_HOME就是hadoop的家目录。
以上就是通过python完成词频统计的过程。