mahout-0.9和hadoop-2.2.0整合安装环境搭建

下载mahout-distribution-0.9.tar.gz

wget http://mirror.bit.edu.cn/apache/mahout/0.9/mahout-distribution-0.9.tar.gz

配置环境变量

export MAHOUT_HOME=/opt/mahout-0.9
export MAHOUT_CONF_DIR=/opt/mahout-0.9/conf
export PATH=$MAHOUT_HOME/bin:$PATH

mahout有两种运行模式,分别是local和hadoop模式。

如果要以local模式运行mahout,那么需要设置环境变量MAHOUT_LOCAL=true,否则将以hadoop模式运行,后者需要先启动hadoop集群。

下载测试数据

wget https://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

启动hadoop集群

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

将测试数据上传到hdfs

scott@master:/var/tmp$ hdfs dfs -mkdir testdata
scott@master:/var/tmp$ hdfs dfs -put synthetic_control.data testdata
scott@master:/var/tmp$ hdfs dfs -ls /user/scott/testdata
Found 1 items
-rw-r--r-- 2 scott scott 288374 2014-04-28 20:57 /user/scott/testdata/synthetic_control.data

分别执行以下job做测试

分别执行以下job

mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
mahout org.apache.mahout.clustering.syntheticcontrol.fuzzykmeans.Job
mahout org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job(执行报错了,mahout-0.9中找不到这个类,估计是删了)
mahout org.apache.mahout.clustering.syntheticcontrol.meanshift.Job(执行报错了,mahout-0.9中找不到这个类,估计是删了)

注:

以上每个示例job提交后都会向hadoop提交一个mapreduce job,后台进行计算并在控制台上打出最终结果

确保已经配置好MAHOUT_HOME,以及添加MAHOUT_HOME/bin到path目录中

确保testdata/synthetic_control.data在hdfs上的目录为/user/scott/testdata/synthetic_control.data(其中scott为用户名).即org.apache.mahout.clustering.syntheticcontrol.kmeans.Job默认的输入路径为hdfs上当前用户目录下的testdata目录。

屏幕输出为:

scott@master:/var/tmp$ mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/hadoop-2.2.0/bin/hadoop and HADOOP_CONF_DIR=/opt/hadoop-2.2.0/etc/hadoop
MAHOUT-JOB: /opt/mahout-0.9/mahout-examples-0.9-job.jar
14/04/28 21:23:54 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.props found on classpath, will use command-line arguments only
14/04/28 21:23:54 INFO kmeans.Job: Running with default arguments
14/04/28 21:23:56 INFO common.HadoopUtil: Deleting output
14/04/28 21:24:00 INFO kmeans.Job: Preparing Input
14/04/28 21:24:00 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.226.131:8032
14/04/28 21:24:00 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/04/28 21:24:02 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 21:24:02 INFO mapreduce.JobSubmitter: number of splits:1
14/04/28 21:24:02 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
14/04/28 21:24:02 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
14/04/28 21:24:02 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
14/04/28 21:24:02 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
14/04/28 21:24:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398689061205_0013
14/04/28 21:24:03 INFO impl.YarnClientImpl: Submitted application application_1398689061205_0013 to ResourceManager at master/192.168.226.131:8032
14/04/28 21:24:03 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1398689061205_0013/
14/04/28 21:24:03 INFO mapreduce.Job: Running job: job_1398689061205_0013
14/04/28 21:24:12 INFO mapreduce.Job: Job job_1398689061205_0013 running in uber mode : false
14/04/28 21:24:12 INFO mapreduce.Job: map 0% reduce 0%
14/04/28 21:24:20 INFO mapreduce.Job: map 100% reduce 0%
14/04/28 21:24:20 INFO mapreduce.Job: Job job_1398689061205_0013 completed successfully
14/04/28 21:24:21 INFO mapreduce.Job: Counters: 27
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=79216
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=288500
HDFS: Number of bytes written=335470
HDFS: Number of read operations=5
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=6130
Total time spent by all reduces in occupied slots (ms)=0
Map-Reduce Framework
Map input records=600
Map output records=600
Input split bytes=126
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=52
CPU time spent (ms)=980
Physical memory (bytes) snapshot=93544448
Virtual memory (bytes) snapshot=999833600
Total committed heap usage (bytes)=15663104
File Input Format Counters
Bytes Read=288374
File Output Format Counters
Bytes Written=335470
14/04/28 21:24:21 INFO kmeans.Job: Running random seed to get initial clusters
14/04/28 21:24:21 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
14/04/28 21:24:21 INFO compress.CodecPool: Got brand-new compressor [.deflate]
14/04/28 21:24:21 INFO kmeans.RandomSeedGenerator: Wrote 6 Klusters to output/random-seeds/part-randomSeed
14/04/28 21:24:21 INFO kmeans.Job: Running KMeans with k = 6
14/04/28 21:24:21 INFO kmeans.KMeansDriver: Input: output/data Clusters In: output/random-seeds/part-randomSeed Out: output
14/04/28 21:24:21 INFO kmeans.KMeansDriver: convergence: 0.5 max Iterations: 10
14/04/28 21:24:21 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
14/04/28 21:24:22 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.226.131:8032
14/04/28 21:24:22 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/04/28 21:24:23 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 21:24:23 INFO mapreduce.JobSubmitter: number of splits:1
14/04/28 21:24:23 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
14/04/28 21:24:23 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
14/04/28 21:24:23 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
14/04/28 21:24:23 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
14/04/28 21:24:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398689061205_0014
14/04/28 21:24:23 INFO impl.YarnClientImpl: Submitted application application_1398689061205_0014 to ResourceManager at master/192.168.226.131:8032
14/04/28 21:24:23 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1398689061205_0014/
14/04/28 21:24:23 INFO mapreduce.Job: Running job: job_1398689061205_0014
14/04/28 21:24:32 INFO mapreduce.Job: Job job_1398689061205_0014 running in uber mode : false
14/04/28 21:24:32 INFO mapreduce.Job: map 0% reduce 0%
14/04/28 21:24:41 INFO mapreduce.Job: map 100% reduce 0%
14/04/28 21:24:49 INFO mapreduce.Job: map 100% reduce 100%
14/04/28 21:24:49 INFO mapreduce.Job: Job job_1398689061205_0014 completed successfully
14/04/28 21:24:49 INFO mapreduce.Job: Counters: 43
File System Counters
FILE: Number of bytes read=10650
FILE: Number of bytes written=181959
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=358669
HDFS: Number of bytes written=7581
HDFS: Number of read operations=37
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=6728
Total time spent by all reduces in occupied slots (ms)=6023
Map-Reduce Framework
Map input records=600
Map output records=6
Map output bytes=10620
Map output materialized bytes=10650
Input split bytes=119
Combine input records=0
Combine output records=0
Reduce input groups=6
Reduce shuffle bytes=10650
Reduce input records=6
Reduce output records=6
Spilled Records=12
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=153
CPU time spent (ms)=2130
Physical memory (bytes) snapshot=292732928
Virtual memory (bytes) snapshot=2001772544
Total committed heap usage (bytes)=136253440
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=335470
File Output Format Counters
Bytes Written=7581
14/04/28 21:24:50 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.226.131:8032
14/04/28 21:24:50 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
14/04/28 21:24:51 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 21:24:51 INFO mapreduce.JobSubmitter: number of splits:1
14/04/28 21:24:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398689061205_0015
14/04/28 21:24:51 INFO impl.YarnClientImpl: Submitted application application_1398689061205_0015 to ResourceManager at master/192.168.226.131:8032
14/04/28 21:24:51 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1398689061205_0015/
14/04/28 21:24:51 INFO mapreduce.Job: Running job: job_1398689061205_0015

中间部分省略...

1.0 : [distance=42.73030826772035]: 60 = [28.010, 27.961, 26.655, 29.143, 25.286, 34.115, 33.728, 29.298, 27.745, 33.405, 29.700, 30.363, 25.760, 35.000, 28.150, 30.786, 32.786, 26.296, 25.722, 25.761, 29.633, 43.771, 49.522, 46.798, 45.578, 41.024, 50.108, 46.032, 50.657, 48.302, 48.651, 50.404, 46.673, 41.523, 51.030, 42.828, 41.135, 51.210, 41.171, 43.582, 51.001, 51.808, 49.873, 43.735, 51.227, 51.355, 43.717, 45.098, 48.935, 44.966, 42.068, 41.637, 44.421, 43.066, 45.898, 48.004, 43.319, 42.409, 42.016, 46.216]
1.0 : [distance=35.77976239218027]: 60 = [35.036, 35.913, 30.958, 31.258, 24.339, 25.777, 29.094, 32.911, 32.630, 29.055, 31.944, 31.832, 33.308, 35.279, 34.865, 35.778, 27.651, 33.173, 27.835, 31.879, 32.646, 30.609, 33.358, 41.870, 45.066, 45.443, 47.100, 50.024, 47.008, 52.741, 45.736, 41.431, 48.348, 44.783, 48.456, 51.256, 41.180, 47.332, 50.033, 41.438, 42.207, 44.012, 52.550, 44.583, 47.623, 46.980, 47.415, 44.941, 42.728, 51.521, 44.785, 42.065, 41.098, 46.414, 43.967, 50.842, 49.872, 41.337, 48.979, 50.323]
1.0 : [distance=41.201065271437834]: 60 = [24.803, 32.798, 32.086, 32.551, 30.258, 27.461, 35.801, 31.986, 30.982, 32.182, 25.148, 29.000, 25.497, 28.623, 33.190, 32.359, 30.669, 35.264, 31.896, 25.874, 28.882, 32.737, 25.278, 33.452, 27.108, 50.295, 45.983, 42.380, 49.701, 47.081, 48.494, 47.235, 49.052, 46.604, 47.542, 50.125, 49.606, 49.441, 49.338, 50.691, 39.759, 50.248, 39.981, 42.056, 47.477, 45.393, 48.282, 46.496, 46.834, 44.100, 41.640, 50.254, 46.767, 50.376, 41.929, 48.467, 42.033, 49.194, 48.123, 44.148]
14/04/28 21:29:41 INFO clustering.ClusterDumper: Wrote 6 clusters
14/04/28 21:29:41 INFO driver.MahoutDriver: Program took 346240 ms (Minutes: 5.770666666666667)

查看输出

scott@master:/var/tmp$ hdfs dfs -ls output
Found 15 items
-rw-r--r-- 2 scott scott 194 2014-04-28 21:11 output/_policy
drwxr-xr-x - scott scott 0 2014-04-28 21:12 output/clusteredPoints
drwxr-xr-x - scott scott 0 2014-04-28 21:06 output/clusters-0
drwxr-xr-x - scott scott 0 2014-04-28 21:07 output/clusters-1
drwxr-xr-x - scott scott 0 2014-04-28 21:11 output/clusters-10-final
drwxr-xr-x - scott scott 0 2014-04-28 21:07 output/clusters-2
drwxr-xr-x - scott scott 0 2014-04-28 21:08 output/clusters-3
drwxr-xr-x - scott scott 0 2014-04-28 21:08 output/clusters-4
drwxr-xr-x - scott scott 0 2014-04-28 21:09 output/clusters-5
drwxr-xr-x - scott scott 0 2014-04-28 21:09 output/clusters-6
drwxr-xr-x - scott scott 0 2014-04-28 21:10 output/clusters-7
drwxr-xr-x - scott scott 0 2014-04-28 21:10 output/clusters-8
drwxr-xr-x - scott scott 0 2014-04-28 21:11 output/clusters-9
drwxr-xr-x - scott scott 0 2014-04-28 21:06 output/data
drwxr-xr-x - scott scott 0 2014-04-28 21:06 output/random-seeds
scott@master:/var/tmp$ hdfs dfs -ls -R output/data
-rw-r--r-- 2 scott scott 0 2014-04-28 21:06 output/data/_SUCCESS
-rw-r--r-- 2 scott scott 335470 2014-04-28 21:06 output/data/part-m-00000

用mahout输出

scott@master:/var/tmp$ mahout vectordump --input output/data/part-m-00000 | more
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/hadoop-2.2.0/bin/hadoop and HADOOP_CONF_DIR=/opt/hadoop-2.2.0/etc/hadoop
MAHOUT-JOB: /opt/mahout-0.9/mahout-examples-0.9-job.jar
14/04/28 21:22:34 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[output/data/part-m-00000], --startPhase=[0], --tempDir=[temp]}
14/04/28 21:22:36 INFO vectors.VectorDumper: Sort? false
{0:28.7812,31:26.6311,34:29.1495,4:28.9207,32:35.6541,5:33.7596,8:35.2479,6:25.3969,30:25.0293,24:33.0292,29:34.9424,17:26.5235,51:24.5556,36:26.1927,12:36.0253,23:29.
5054,58:25.4652,21:29.27,11:29.2171,10:32.8717,15:32.8717,7:27.7849,28:26.1203,46:28.0721,33:28.4353,55:34.9879,54:34.9318,25:25.04,3:31.2834,49:29.747,41:26.2353,1:34
.4632,26:28.9167,44:31.0558,37:33.3182,56:32.4721,42:28.9964,27:24.3437,50:31.4333,16:34.1173,40:35.5344,48:35.4973,39:27.0443,9:27.1159,52:33.7431,13:32.337,43:32.003
6,19:26.3693,59:25.8717,2:31.3381,20:25.7744,18:27.6623,22:30.7326,35:28.1584,57:33.3759,45:34.2553,38:30.9772,47:28.9402,14:34.5249,53:25.0466}
{0:24.8923,31:32.5981,34:26.9414,4:27.8789,32:28.3038,5:31.5926,8:27.9516,6:31.4861,30:34.0765,24:31.9874,29:25.0701,17:35.6273,51:31.0205,36:33.1089,12:27.4867,23:30.
4719,58:32.1005,21:24.1311,11:31.1887,10:27.5415,15:24.488,7:35.5469,28:33.6472,46:26.3458,33:26.1471,55:26.4244,54:33.6564,25:33.6615,3:32.8217,49:29.4047,41:26.5301,
1:25.741,26:25.5511,44:32.8357,37:24.1491,56:28.4661,42:24.8578,27:30.4686,50:32.5577,16:27.5918,40:35.9519,48:28.9861,39:25.7906,9:31.6595,52:26.6418,13:31.391,43:25.
9562,19:31.4167,59:26.691,2:27.5532,20:30.7447,18:35.4102,22:35.1422,35:31.5203,57:34.2484,45:28.5322,38:28.5157,47:30.6213,14:27.811,53:28.4331}
{0:31.3987,31:24.246,34:31.6114,4:27.8613,32:26.9631,5:28.5491,8:25.2239,6:24.9717,30:27.3086,24:24.3323,29:28.8778,17:32.5614,51:26.5966,36:27.4809,12:28.2572,23:32.3
851,58:29.5446,21:31.4781,11:27.2587,10:31.8387,15:35.0625,7:32.4358,28:31.5137,46:29.6082,33:25.2919,55:29.9897,54:25.5772,25:30.2001,3:24.2905,49:27.1717,41:31.0561,
1:30.6316,26:31.2452,44:31.4391,37:24.2075,56:31.351,42:26.3583,27:26.6814,50:33.6318,16:31.5717,40:32.6293,48:34.1444,39:35.1253,9:27.3068,52:25.5387,13:26.5819,43:28
.0861,19:34.1202,59:29.343,2:26.3983,20:26.9337,18:31.0308,22:35.0173,35:24.7131,57:33.9002,45:27.3057,38:26.8059,47:35.9725,14:24.0455,53:32.5434}
{0:25.774,31:28.3714,34:35.9346,4:27.97,32:32.3667,5:25.2702,8:31.4549,6:28.132,30:27.5587,24:29.2806,29:24.824,17:35.0966,51:28.7261,36:24.3749,12:29.9578,23:31.6264,
58:27.3659,21:25.0102,11:28.9916,10:28.9564,15:24.3037,7:29.4268,28:25.5265,46:35.769,33:26.9752,55:32.5492,54:34.6156,25:34.2021,3:25.6033,49:31.156,41:26.8908,1:30.5
262,26:26.5077,44:34.3336,37:27.6083,56:30.9827,42:31.3209,27:32.2279,50:34.6292,16:24.314,40:32.4185,48:34.2054,39:29.8557,9:27.32,52:28.2979,13:30.2773,43:29.3849,19
:32.0968,59:25.3069,2:35.4209,20:33.3303,18:25.3679,22:35.3155,35:35.1146,57:24.8938,45:24.7381,38:27.8433,47:31.8725,14:30.4447,53:31.5787}
{0:27.1798,31:33.4129,34:29.6526,4:24.6555,32:26.9245,5:28.9446,8:24.5596,6:35.798,30:33.1247,24:24.6081,29:28.0295,17:31.1274,51:27.9601,36:24.5119,12:35.4154,23:33.0
321,58:31.1057,21:31.6565,11:25.3216,10:27.9634,15:29.4686,7:34.9446,28:35.8773,46:29.1348,33:30.2123,55:29.9993,54:35.3375,25:33.2025,3:25.6264,49:34.9244,41:27.9072,
1:29.2498,26:27.4335,44:33.833,37:33.9931,56:34.2149,42:35.111,27:32.6355,50:27.7218,16:33.1739,40:31.2651,48:32.3223,39:33.204,9:34.2366,52:35.7198,13:34.862,43:35.07
57,19:26.5173,59:31.0179,2:33.6928,20:28.6486,18:31.3701,22:35.9497,35:30.8644,57:33.1276,45:25.9481,38:33.3094,47:24.2875,14:25.1472,53:27.576}
{0:25.5067,31:34.7268,34:35.9963,4:33.8,32:29.9207,5:27.6671,8:30.1171,6:30.6122,30:24.2361,24:35.7385,29:27.4559,17:33.901,51:35.8381,36:27.139,12:29.5582,23:27.7188,
58:33.1669,21:26.496,11:27.8514,10:30.1524,15:26.1001,7:25.6393,28:27.3321,46:35.6706,33:27.273,55:25.2093,54:27.341,25:32.8309,3:34.4812,49:31.1795,41:25.8897,1:29.79
29,26:30.1509,44:34.9652,37:26.4589,56:33.4669,42:31.3951,27:30.5593,50:26.9458,16:33.4677,40:27.9961,48:28.458,39:35.5002,9:26.5188,52:26.7134,13:32.3601,43:30.7583,1
9:34.8311,59:35.4907,2:28.0765,20:31.9815,18:29.2674,22:32.6645,35:32.3917,57:24.1094,45:28.0919,38:25.0466,47:33.4401,14:29.2064,53:25.1641}
{0:28.6989,31:35.4191,34:24.5244,4:31.4138,32:33.3472,5:28.4636,8:28.7669,6:35.9115,30:24.4462,24:29.5057,29:25.1053,17:24.553,51:24.0323,36:24.6889,12:29.1154,23:27.9
928,58:29.6669,21:25.2126,11:33.7291,10:34.8983,15:31.6103,7:32.9058,28:34.0438,46:24.1047,33:32.2356,55:25.893,54:25.4358,25:31.0723,3:34.6229,49:24.0499,41:35.7727,1
:29.2101,26:26.3605,44:32.7839,37:28.1962,56:35.6732,42:31.3444,27:27.7434,50:29.8274,16:33.3061,40:30.8005,48:28.8249,39:31.6316,9:24.2868,52:31.0756,13:26.2804,43:25
.5691,19:27.8378,59:26.4637,2:30.9291,20:25.3525,18:29.1587,22:26.9565,35:29.4635,57:25.1869,45:32.7707,38:34.2994,47:34.006,14:33.4559,53:34.3358}
{0:30.9493,31:32.4431,34:29.2942,4:30.6691,32:30.0745,5:35.2667,8:28.8917,6:35.895,30:34.5734,24:28.8616,29:29.4617,17:25.2452,51:31.1248,36:28.4819,12:31.7516,23:32.6
26,58:31.3025,21:33.7765,11:26.0572,10:28.9898,15:24.1612,7:25.9022,28:34.8895,46:24.0331,33:25.0495,55:36.0187,54:35.9365,25:27.6223,3:34.8829,49:30.5934,41:33.0784,1
:34.317,26:33.9381,44:24.4895,37:29.8917,56:26.3866,42:33.2286,27:33.9836,50:35.4341,16:26.6554,40:27.4442,48:34.8568,39:26.4574,9:32.2092,52:24.2424,13:32.294,43:27.5
837,19:31.391,59:34.523,2:35.5674,20:32.1604,18:30.5956,22:31.1336,35:28.2689,57:33.1842,45:26.2151,38:33.1162,47:26.4765,14:31.0631,53:29.7172}

以local模式运行20newsgroup示例

下载http://qwone.com/~jason/20Newsgroups/三个包,下载20news-bydate.tar.gz

wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz

下面的操作可有可无,因为下面的脚本会检测文件不存在,就会自动去下载。

创建/tmp/mahout-work-scott目录并拷贝20news-bydate.tar.gz到/tmp/mahout-work-scott目录

scott@master:/opt/mahout-0.9/examples/bin$ mkdir -p /tmp/mahout-work-scott
scott@master:/opt/mahout-0.9/examples/bin$ mv 20news-bydate.tar.gz /tmp/mahout-work-scott/
scott@master:/opt/mahout-0.9/examples/bin$ ls /tmp/mahout-work-scott/
20news-bydate.tar.gz

运行

export MAHOUT_LOCAL=true
scott@master:/opt/mahout-0.9/examples/bin$ ./classify-20newsgroups.sh
Please select a number to choose the corresponding task to run
1. cnaivebayes
2. naivebayes
3. sgd
4. clean -- cleans up the work area in /tmp/mahout-work-scott
Enter your choice : 1 ### 这里选择1
ok. You chose 1 and we'll use cnaivebayes
creating work directory at /tmp/mahout-work-scott
Downloading 20news-bydate
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 13.7M 100 13.7M 0 0 180k 0 0:01:18 0:01:18 --:--:-- 197k
Extracting...
+ echo 'Preparing 20newsgroups data'
Preparing 20newsgroups data
+ rm -rf /tmp/mahout-work-scott/20news-all
+ mkdir /tmp/mahout-work-scott/20news-all
+ cp -R /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/alt.atheism /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.graphics /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.os.ms-windows.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.sys.ibm.pc.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.sys.mac.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.windows.x /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/misc.forsale /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.autos /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.motorcycles /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.sport.baseball /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.sport.hockey /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.crypt /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.electronics /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.med /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.space /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/soc.religion.christian /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.politics.guns /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.politics.mideast /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.politics.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.religion.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/alt.atheism /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.graphics /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.os.ms-windows.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.sys.ibm.pc.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.sys.mac.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.windows.x /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/misc.forsale /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.autos /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.motorcycles /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.sport.baseball /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.sport.hockey /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.crypt /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.electronics /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.med /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.space /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/soc.religion.christian /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.politics.guns /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.politics.mideast /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.politics.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.religion.misc /tmp/mahout-work-scott/20news-all
+ '[' /opt/hadoop-2.2.0 '!=' '' ']'
+ '[' true == '' ']'
+ echo 'Creating sequence files from 20newsgroups data'
Creating sequence files from 20newsgroups data
+ ./bin/mahout seqdirectory -i /tmp/mahout-work-scott/20news-all -o /tmp/mahout-work-scott/20news-seq -ow
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.

MAHOUT_LOCAL is set, running locally
14/04/28 22:26:35 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/tmp/mahout-work-scott/20news-all], --keyPrefix=[], --method=[mapreduce], --output=[/tmp/mahout-work-scott/20news-seq], --overwrite=null, --startPhase=[0], --tempDir=[temp]}
3:0+1171,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176995:0+4341,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176919:0+1525,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178661:0+1894,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178851:0+907,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178977:0+1824,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178741:0+1551,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178721:0+1306,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176952:0+580,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178624:0+1027,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176916:0+1396,/tmp/mahout-work-scott/20news-all/talk.politics.misc/177023:0+1079,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176992:0+751,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178877:0+2188,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178947:0+965,/tmp/mahout-work-scott/20news-all/talk.politics.misc/179031:0+10384,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178296:0+828,/tmp/mahout-work-scott/20news-all/talk.politics.misc/177011:0+1624,/tmp/mahout-work-scott/20news-all/talk.politics.misc/179027:0+4002,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178966:0+2239,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178316:0+2794,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176862:0+2317,/tmp/mahout-work-scott/20news-all/talk.politics.misc/179107:0+1775,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178740:0+1280,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178549:0+3629,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178505:0+1797,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178361:0+2936,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178587:0+3352,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178889:0+714,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178438:0+1373,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178353:0+1964,/tmp/mahout-work-scott/20news-all/talk.politics.misc/179099:0+1673,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178493:0+7984,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178308:0+5910,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178558:0+4965,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178717:0+1537,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178654:0+4126,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178894:0+9233,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178775:0+3393,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178347:0+2092,/tmp/mahout-work-scott/20news-all/talk.politics.misc/177000:0+888,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178855:0+2670,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176890:0+4588,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178990:0+2247,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178869:0+2405,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178892:0+1838,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178697:0+2831,/tmp/mahout-work-scott/20news-all/talk.politics.misc/179016:0+519,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178456:0+1906,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178409:0+1404,/tmp/mahout-work-scott/20news-all/talk.politics.misc/178545:0+1207,/tmp/mahout-work-scott/20news-all/talk.politics.misc/176883:0+783
14/04/28 22:26:49 INFO compress.CodecPool: Got brand-new compressor
14/04/28 22:26:53 INFO mapred.LocalJobRunner:
14/04/28 22:26:54 INFO mapred.JobClient: map 38% reduce 0%
14/04/28 22:26:56 INFO mapred.LocalJobRunner:
14/04/28 22:26:57 INFO mapred.JobClient: map 84% reduce 0%
14/04/28 22:26:57 INFO mapred.Task: Task:attempt_local1376615723_0001_m_000000_0 is done. And is in the process of commiting
14/04/28 22:26:57 INFO mapred.LocalJobRunner:
14/04/28 22:26:57 INFO mapred.Task: Task attempt_local1376615723_0001_m_000000_0 is allowed to commit now
14/04/28 22:26:57 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1376615723_0001_m_000000_0' to /tmp/mahout-work-scott/20news-seq
14/04/28 22:26:57 INFO mapred.LocalJobRunner:
14/04/28 22:26:57 INFO mapred.Task: Task 'attempt_local1376615723_0001_m_000000_0' done.
14/04/28 22:26:57 INFO mapred.LocalJobRunner: Finishing task: attempt_local1376615723_0001_m_000000_0
14/04/28 22:26:57 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:26:58 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:26:58 INFO mapred.JobClient: Job complete: job_local1376615723_0001
14/04/28 22:26:58 INFO mapred.JobClient: Counters: 12
14/04/28 22:26:58 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:26:58 INFO mapred.JobClient: Bytes Written=19351795
14/04/28 22:26:58 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:26:58 INFO mapred.JobClient: Bytes Read=0
14/04/28 22:26:58 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:26:58 INFO mapred.JobClient: FILE_BYTES_READ=61510586
14/04/28 22:26:58 INFO mapred.JobClient: FILE_BYTES_WRITTEN=45251147
14/04/28 22:26:58 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:26:58 INFO mapred.JobClient: Map input records=18846
14/04/28 22:26:58 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:26:58 INFO mapred.JobClient: Spilled Records=0
14/04/28 22:26:58 INFO mapred.JobClient: Total committed heap usage (bytes)=45019136
14/04/28 22:26:58 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:26:58 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:26:58 INFO mapred.JobClient: SPLIT_RAW_BYTES=1465642
14/04/28 22:26:58 INFO mapred.JobClient: Map output records=18846
14/04/28 22:26:58 INFO driver.MahoutDriver: Program took 22742 ms (Minutes: 0.37903333333333333)
+ echo 'Converting sequence files to vectors'
Converting sequence files to vectors
+ ./bin/mahout seq2sparse -i /tmp/mahout-work-scott/20news-seq -o /tmp/mahout-work-scott/20news-vectors -lnorm -nv -wt tfidf
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
14/04/28 22:26:59 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
14/04/28 22:26:59 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
14/04/28 22:26:59 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
14/04/28 22:26:59 INFO vectorizer.SparseVectorsFromSequenceFiles: Tokenizing documents in /tmp/mahout-work-scott/20news-seq
14/04/28 22:27:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/28 22:27:00 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:00 INFO mapred.JobClient: Running job: job_local555795545_0001
14/04/28 22:27:00 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:00 INFO mapred.LocalJobRunner: Starting task: attempt_local555795545_0001_m_000000_0
14/04/28 22:27:00 INFO util.ProcessTree: setsid exited with exit code 0
14/04/28 22:27:00 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3098cc00
14/04/28 22:27:01 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-seq/part-m-00000:0+19201771
14/04/28 22:27:01 INFO compress.CodecPool: Got brand-new decompressor
14/04/28 22:27:01 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:05 INFO mapred.Task: Task:attempt_local555795545_0001_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:06 INFO mapred.LocalJobRunner:
14/04/28 22:27:06 INFO mapred.Task: Task attempt_local555795545_0001_m_000000_0 is allowed to commit now
14/04/28 22:27:06 INFO output.FileOutputCommitter: Saved output of task 'attempt_local555795545_0001_m_000000_0' to /tmp/mahout-work-scott/20news-vectors/tokenized-documents
14/04/28 22:27:06 INFO mapred.LocalJobRunner:
14/04/28 22:27:06 INFO mapred.Task: Task 'attempt_local555795545_0001_m_000000_0' done.
14/04/28 22:27:06 INFO mapred.LocalJobRunner: Finishing task: attempt_local555795545_0001_m_000000_0
14/04/28 22:27:06 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:06 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:06 INFO mapred.JobClient: Job complete: job_local555795545_0001
14/04/28 22:27:06 INFO mapred.JobClient: Counters: 12
14/04/28 22:27:06 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:06 INFO mapred.JobClient: Bytes Written=27717556
14/04/28 22:27:06 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:06 INFO mapred.JobClient: Bytes Read=19351795
14/04/28 22:27:06 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:06 INFO mapred.JobClient: FILE_BYTES_READ=20821532
14/04/28 22:27:06 INFO mapred.JobClient: FILE_BYTES_WRITTEN=29250590
14/04/28 22:27:06 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:06 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:06 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:06 INFO mapred.JobClient: Spilled Records=0
14/04/28 22:27:06 INFO mapred.JobClient: Total committed heap usage (bytes)=31653888
14/04/28 22:27:06 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:06 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:06 INFO mapred.JobClient: SPLIT_RAW_BYTES=116
14/04/28 22:27:06 INFO mapred.JobClient: Map output records=18846
14/04/28 22:27:06 INFO vectorizer.SparseVectorsFromSequenceFiles: Creating Term Frequency Vectors
14/04/28 22:27:06 INFO vectorizer.DictionaryVectorizer: Creating dictionary from /tmp/mahout-work-scott/20news-vectors/tokenized-documents and saving at /tmp/mahout-work-scott/20news-vectors/wordcount
14/04/28 22:27:06 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:07 INFO mapred.JobClient: Running job: job_local761839717_0002
14/04/28 22:27:07 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:07 INFO mapred.LocalJobRunner: Starting task: attempt_local761839717_0002_m_000000_0
14/04/28 22:27:07 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@611f3b6b
14/04/28 22:27:07 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/tokenized-documents/part-m-00000:0+27502680
14/04/28 22:27:07 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:07 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:07 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:08 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:09 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:09 INFO mapred.MapTask: bufstart = 0; bufend = 3901663; bufvoid = 99614720
14/04/28 22:27:09 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
14/04/28 22:27:10 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:10 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:10 INFO mapred.MapTask: bufstart = 3901663; bufend = 7767143; bufvoid = 99614720
14/04/28 22:27:10 INFO mapred.MapTask: kvstart = 262144; kvend = 196607; length = 327680
14/04/28 22:27:11 INFO mapred.MapTask: Finished spill 1
14/04/28 22:27:11 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:11 INFO mapred.MapTask: bufstart = 7767143; bufend = 11664511; bufvoid = 99614720
14/04/28 22:27:11 INFO mapred.MapTask: kvstart = 196607; kvend = 131070; length = 327680
14/04/28 22:27:11 INFO mapred.MapTask: Finished spill 2
14/04/28 22:27:12 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:12 INFO mapred.MapTask: bufstart = 11664511; bufend = 15439459; bufvoid = 99614720
14/04/28 22:27:12 INFO mapred.MapTask: kvstart = 131070; kvend = 65533; length = 327680
14/04/28 22:27:12 INFO mapred.MapTask: Finished spill 3
14/04/28 22:27:12 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:12 INFO mapred.MapTask: bufstart = 15439459; bufend = 19316245; bufvoid = 99614720
14/04/28 22:27:12 INFO mapred.MapTask: kvstart = 65533; kvend = 327677; length = 327680
14/04/28 22:27:13 INFO mapred.LocalJobRunner:
14/04/28 22:27:13 INFO mapred.MapTask: Finished spill 4
14/04/28 22:27:13 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:13 INFO mapred.MapTask: bufstart = 19316245; bufend = 23269785; bufvoid = 99614720
14/04/28 22:27:13 INFO mapred.MapTask: kvstart = 327677; kvend = 262140; length = 327680
14/04/28 22:27:13 INFO mapred.MapTask: Finished spill 5
14/04/28 22:27:14 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:14 INFO mapred.MapTask: bufstart = 23269785; bufend = 27114686; bufvoid = 99614720
14/04/28 22:27:14 INFO mapred.MapTask: kvstart = 262140; kvend = 196603; length = 327680
14/04/28 22:27:14 INFO mapred.JobClient: map 51% reduce 0%
14/04/28 22:27:14 INFO mapred.MapTask: Finished spill 6
14/04/28 22:27:14 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:14 INFO mapred.MapTask: bufstart = 27114686; bufend = 30988974; bufvoid = 99614720
14/04/28 22:27:14 INFO mapred.MapTask: kvstart = 196603; kvend = 131066; length = 327680
14/04/28 22:27:15 INFO mapred.MapTask: Finished spill 7
14/04/28 22:27:15 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:15 INFO mapred.MapTask: bufstart = 30988974; bufend = 34910626; bufvoid = 99614720
14/04/28 22:27:15 INFO mapred.MapTask: kvstart = 131066; kvend = 65529; length = 327680
14/04/28 22:27:15 INFO mapred.MapTask: Finished spill 8
14/04/28 22:27:15 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:15 INFO mapred.MapTask: bufstart = 34910626; bufend = 38817315; bufvoid = 99614720
14/04/28 22:27:15 INFO mapred.MapTask: kvstart = 65529; kvend = 327673; length = 327680
14/04/28 22:27:15 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:16 INFO mapred.LocalJobRunner:
14/04/28 22:27:16 INFO mapred.MapTask: Finished spill 9
14/04/28 22:27:16 INFO mapred.MapTask: Finished spill 10
14/04/28 22:27:16 INFO mapred.Merger: Merging 11 sorted segments
14/04/28 22:27:16 INFO mapred.Merger: Merging 2 intermediate segments out of a total of 11
14/04/28 22:27:16 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 6794584 bytes
14/04/28 22:27:17 INFO mapred.Task: Task:attempt_local761839717_0002_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:17 INFO mapred.LocalJobRunner:
14/04/28 22:27:17 INFO mapred.Task: Task 'attempt_local761839717_0002_m_000000_0' done.
14/04/28 22:27:17 INFO mapred.LocalJobRunner: Finishing task: attempt_local761839717_0002_m_000000_0
14/04/28 22:27:17 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:17 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:17 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@65fd170b
14/04/28 22:27:17 INFO mapred.LocalJobRunner:
14/04/28 22:27:17 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:17 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3538080 bytes
14/04/28 22:27:17 INFO mapred.LocalJobRunner:
14/04/28 22:27:17 INFO mapred.Task: Task:attempt_local761839717_0002_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:17 INFO mapred.LocalJobRunner:
14/04/28 22:27:17 INFO mapred.Task: Task attempt_local761839717_0002_r_000000_0 is allowed to commit now
14/04/28 22:27:17 INFO output.FileOutputCommitter: Saved output of task 'attempt_local761839717_0002_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/wordcount
14/04/28 22:27:17 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:17 INFO mapred.Task: Task 'attempt_local761839717_0002_r_000000_0' done.
14/04/28 22:27:18 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:18 INFO mapred.JobClient: Job complete: job_local761839717_0002
14/04/28 22:27:18 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:18 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:18 INFO mapred.JobClient: Bytes Written=2333133
14/04/28 22:27:18 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:18 INFO mapred.JobClient: Bytes Read=27717556
14/04/28 22:27:18 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:18 INFO mapred.JobClient: FILE_BYTES_READ=119302696
14/04/28 22:27:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=86725091
14/04/28 22:27:18 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:18 INFO mapred.JobClient: Reduce input groups=192904
14/04/28 22:27:18 INFO mapred.JobClient: Map output materialized bytes=3538084
14/04/28 22:27:18 INFO mapred.JobClient: Combine output records=567522
14/04/28 22:27:18 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:18 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:18 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:18 INFO mapred.JobClient: Reduce output records=93563
14/04/28 22:27:18 INFO mapred.JobClient: Spilled Records=819322
14/04/28 22:27:18 INFO mapred.JobClient: Map output bytes=39462740
14/04/28 22:27:18 INFO mapred.JobClient: Total committed heap usage (bytes)=262676480
14/04/28 22:27:18 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:18 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=140
14/04/28 22:27:18 INFO mapred.JobClient: Map output records=2664273
14/04/28 22:27:18 INFO mapred.JobClient: Combine input records=3038891
14/04/28 22:27:18 INFO mapred.JobClient: Reduce input records=192904
14/04/28 22:27:19 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:19 INFO filecache.TrackerDistributedCacheManager: Creating dictionary.file-0 in /tmp/hadoop-scott/mapred/local/archive/-780352276620675591_-511400580_683383803/file/tmp/mahout-work-scott/20news-vectors-work-460907113576986195 with rwxr-xr-x
14/04/28 22:27:19 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/20news-vectors/dictionary.file-0 as /tmp/hadoop-scott/mapred/local/archive/-780352276620675591_-511400580_683383803/file/tmp/mahout-work-scott/20news-vectors/dictionary.file-0
14/04/28 22:27:19 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/20news-vectors/dictionary.file-0 as /tmp/hadoop-scott/mapred/local/archive/-780352276620675591_-511400580_683383803/file/tmp/mahout-work-scott/20news-vectors/dictionary.file-0
14/04/28 22:27:19 INFO mapred.JobClient: Running job: job_local1997223068_0003
14/04/28 22:27:19 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:19 INFO mapred.LocalJobRunner: Starting task: attempt_local1997223068_0003_m_000000_0
14/04/28 22:27:19 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@413fd214
14/04/28 22:27:19 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/tokenized-documents/part-m-00000:0+27502680
14/04/28 22:27:19 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:19 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:19 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:20 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:21 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:22 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:22 INFO mapred.Task: Task:attempt_local1997223068_0003_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:22 INFO mapred.LocalJobRunner:
14/04/28 22:27:22 INFO mapred.Task: Task 'attempt_local1997223068_0003_m_000000_0' done.
14/04/28 22:27:22 INFO mapred.LocalJobRunner: Finishing task: attempt_local1997223068_0003_m_000000_0
14/04/28 22:27:22 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:22 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@327579f6
14/04/28 22:27:22 INFO mapred.LocalJobRunner:
14/04/28 22:27:22 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:22 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 27274287 bytes
14/04/28 22:27:22 INFO mapred.LocalJobRunner:
14/04/28 22:27:23 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:26 INFO mapred.Task: Task:attempt_local1997223068_0003_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:26 INFO mapred.LocalJobRunner:
14/04/28 22:27:26 INFO mapred.Task: Task attempt_local1997223068_0003_r_000000_0 is allowed to commit now
14/04/28 22:27:26 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1997223068_0003_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/partial-vectors-0
14/04/28 22:27:26 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:26 INFO mapred.Task: Task 'attempt_local1997223068_0003_r_000000_0' done.
14/04/28 22:27:27 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:27 INFO mapred.JobClient: Job complete: job_local1997223068_0003
14/04/28 22:27:27 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:27 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:27 INFO mapred.JobClient: Bytes Written=29543146
14/04/28 22:27:27 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:27 INFO mapred.JobClient: Bytes Read=27717556
14/04/28 22:27:27 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:27 INFO mapred.JobClient: FILE_BYTES_READ=219012655
14/04/28 22:27:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=184031012
14/04/28 22:27:27 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:27 INFO mapred.JobClient: Reduce input groups=18846
14/04/28 22:27:27 INFO mapred.JobClient: Map output materialized bytes=27274291
14/04/28 22:27:27 INFO mapred.JobClient: Combine output records=0
14/04/28 22:27:27 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:27 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:27 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:27 INFO mapred.JobClient: Reduce output records=18846
14/04/28 22:27:27 INFO mapred.JobClient: Spilled Records=37692
14/04/28 22:27:27 INFO mapred.JobClient: Map output bytes=27199343
14/04/28 22:27:27 INFO mapred.JobClient: Total committed heap usage (bytes)=352329728
14/04/28 22:27:27 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:27 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=140
14/04/28 22:27:27 INFO mapred.JobClient: Map output records=18846
14/04/28 22:27:27 INFO mapred.JobClient: Combine input records=0
14/04/28 22:27:27 INFO mapred.JobClient: Reduce input records=18846
14/04/28 22:27:27 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:27 INFO mapred.JobClient: Running job: job_local1365608165_0004
14/04/28 22:27:27 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:27 INFO mapred.LocalJobRunner: Starting task: attempt_local1365608165_0004_m_000000_0
14/04/28 22:27:27 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3a3a5631
14/04/28 22:27:27 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/partial-vectors-0/part-r-00000:0+29314118
14/04/28 22:27:27 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:27 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:27 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:28 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:28 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:28 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:28 INFO mapred.Task: Task:attempt_local1365608165_0004_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:28 INFO mapred.LocalJobRunner:
14/04/28 22:27:28 INFO mapred.Task: Task 'attempt_local1365608165_0004_m_000000_0' done.
14/04/28 22:27:28 INFO mapred.LocalJobRunner: Finishing task: attempt_local1365608165_0004_m_000000_0
14/04/28 22:27:28 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:28 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@40eb0858
14/04/28 22:27:28 INFO mapred.LocalJobRunner:
14/04/28 22:27:28 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:28 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 29059394 bytes
14/04/28 22:27:28 INFO mapred.LocalJobRunner:
14/04/28 22:27:29 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:30 INFO mapred.Task: Task:attempt_local1365608165_0004_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:30 INFO mapred.LocalJobRunner:
14/04/28 22:27:30 INFO mapred.Task: Task attempt_local1365608165_0004_r_000000_0 is allowed to commit now
14/04/28 22:27:30 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1365608165_0004_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/tf-vectors-toprune
14/04/28 22:27:30 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:30 INFO mapred.Task: Task 'attempt_local1365608165_0004_r_000000_0' done.
14/04/28 22:27:31 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:31 INFO mapred.JobClient: Job complete: job_local1365608165_0004
14/04/28 22:27:31 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:31 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:31 INFO mapred.JobClient: Bytes Written=29543146
14/04/28 22:27:31 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:31 INFO mapred.JobClient: Bytes Read=29543146
14/04/28 22:27:31 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:31 INFO mapred.JobClient: FILE_BYTES_READ=339324382
14/04/28 22:27:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=304304560
14/04/28 22:27:31 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:31 INFO mapred.JobClient: Reduce input groups=18846
14/04/28 22:27:31 INFO mapred.JobClient: Map output materialized bytes=29059398
14/04/28 22:27:31 INFO mapred.JobClient: Combine output records=0
14/04/28 22:27:31 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:31 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:31 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:31 INFO mapred.JobClient: Reduce output records=18846
14/04/28 22:27:31 INFO mapred.JobClient: Spilled Records=37692
14/04/28 22:27:31 INFO mapred.JobClient: Map output bytes=28984080
14/04/28 22:27:31 INFO mapred.JobClient: Total committed heap usage (bytes)=323362816
14/04/28 22:27:31 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:31 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:31 INFO mapred.JobClient: SPLIT_RAW_BYTES=138
14/04/28 22:27:31 INFO mapred.JobClient: Map output records=18846
14/04/28 22:27:31 INFO mapred.JobClient: Combine input records=0
14/04/28 22:27:31 INFO mapred.JobClient: Reduce input records=18846
14/04/28 22:27:31 INFO common.HadoopUtil: Deleting /tmp/mahout-work-scott/20news-vectors/partial-vectors-0
14/04/28 22:27:31 INFO vectorizer.SparseVectorsFromSequenceFiles: Calculating IDF
14/04/28 22:27:31 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:31 INFO mapred.JobClient: Running job: job_local5281002_0005
14/04/28 22:27:31 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:31 INFO mapred.LocalJobRunner: Starting task: attempt_local5281002_0005_m_000000_0
14/04/28 22:27:31 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@320ff7d3
14/04/28 22:27:31 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/tf-vectors-toprune/part-r-00000:0+29314118
14/04/28 22:27:31 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:31 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:31 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:31 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:31 INFO mapred.MapTask: bufstart = 0; bufend = 3145728; bufvoid = 99614720
14/04/28 22:27:31 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length = 327680
14/04/28 22:27:32 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:32 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:32 INFO mapred.MapTask: bufstart = 3145728; bufend = 6291444; bufvoid = 99614720
14/04/28 22:27:32 INFO mapred.MapTask: kvstart = 262144; kvend = 196607; length = 327680
14/04/28 22:27:32 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:32 INFO mapred.MapTask: Finished spill 1
14/04/28 22:27:32 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:32 INFO mapred.MapTask: bufstart = 6291444; bufend = 9437160; bufvoid = 99614720
14/04/28 22:27:32 INFO mapred.MapTask: kvstart = 196607; kvend = 131070; length = 327680
14/04/28 22:27:32 INFO mapred.MapTask: Finished spill 2
14/04/28 22:27:32 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:32 INFO mapred.MapTask: bufstart = 9437160; bufend = 12582876; bufvoid = 99614720
14/04/28 22:27:32 INFO mapred.MapTask: kvstart = 131070; kvend = 65533; length = 327680
14/04/28 22:27:33 INFO mapred.MapTask: Finished spill 3
14/04/28 22:27:33 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:33 INFO mapred.MapTask: bufstart = 12582876; bufend = 15728604; bufvoid = 99614720
14/04/28 22:27:33 INFO mapred.MapTask: kvstart = 65533; kvend = 327677; length = 327680
14/04/28 22:27:33 INFO mapred.MapTask: Finished spill 4
14/04/28 22:27:33 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:33 INFO mapred.MapTask: bufstart = 15728604; bufend = 18874320; bufvoid = 99614720
14/04/28 22:27:33 INFO mapred.MapTask: kvstart = 327677; kvend = 262140; length = 327680
14/04/28 22:27:34 INFO mapred.MapTask: Finished spill 5
14/04/28 22:27:34 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:34 INFO mapred.MapTask: bufstart = 18874320; bufend = 22020036; bufvoid = 99614720
14/04/28 22:27:34 INFO mapred.MapTask: kvstart = 262140; kvend = 196603; length = 327680
14/04/28 22:27:34 INFO mapred.MapTask: Finished spill 6
14/04/28 22:27:34 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:34 INFO mapred.MapTask: bufstart = 22020036; bufend = 25165752; bufvoid = 99614720
14/04/28 22:27:34 INFO mapred.MapTask: kvstart = 196603; kvend = 131066; length = 327680
14/04/28 22:27:34 INFO mapred.MapTask: Finished spill 7
14/04/28 22:27:34 INFO mapred.MapTask: Spilling map output: record full = true
14/04/28 22:27:34 INFO mapred.MapTask: bufstart = 25165752; bufend = 28311468; bufvoid = 99614720
14/04/28 22:27:34 INFO mapred.MapTask: kvstart = 131066; kvend = 65529; length = 327680
14/04/28 22:27:35 INFO mapred.MapTask: Finished spill 8
14/04/28 22:27:35 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:35 INFO mapred.MapTask: Finished spill 9
14/04/28 22:27:35 INFO mapred.Merger: Merging 10 sorted segments
14/04/28 22:27:35 INFO mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 3570888 bytes
14/04/28 22:27:35 INFO mapred.Task: Task:attempt_local5281002_0005_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:35 INFO mapred.LocalJobRunner:
14/04/28 22:27:35 INFO mapred.Task: Task 'attempt_local5281002_0005_m_000000_0' done.
14/04/28 22:27:35 INFO mapred.LocalJobRunner: Finishing task: attempt_local5281002_0005_m_000000_0
14/04/28 22:27:35 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:35 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@223f066a
14/04/28 22:27:35 INFO mapred.LocalJobRunner:
14/04/28 22:27:35 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:35 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1309898 bytes
14/04/28 22:27:35 INFO mapred.LocalJobRunner:
14/04/28 22:27:35 INFO mapred.Task: Task:attempt_local5281002_0005_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:35 INFO mapred.LocalJobRunner:
14/04/28 22:27:35 INFO mapred.Task: Task attempt_local5281002_0005_r_000000_0 is allowed to commit now
14/04/28 22:27:35 INFO output.FileOutputCommitter: Saved output of task 'attempt_local5281002_0005_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/df-count
14/04/28 22:27:35 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:35 INFO mapred.Task: Task 'attempt_local5281002_0005_r_000000_0' done.
14/04/28 22:27:36 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:36 INFO mapred.JobClient: Job complete: job_local5281002_0005
14/04/28 22:27:36 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:36 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:36 INFO mapred.JobClient: Bytes Written=1904849
14/04/28 22:27:36 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:36 INFO mapred.JobClient: Bytes Read=29543146
14/04/28 22:27:36 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:36 INFO mapred.JobClient: FILE_BYTES_READ=438861350
14/04/28 22:27:36 INFO mapred.JobClient: FILE_BYTES_WRITTEN=348581673
14/04/28 22:27:36 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:36 INFO mapred.JobClient: Reduce input groups=93564
14/04/28 22:27:36 INFO mapred.JobClient: Map output materialized bytes=1309902
14/04/28 22:27:36 INFO mapred.JobClient: Combine output records=348626
14/04/28 22:27:36 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:36 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:36 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:36 INFO mapred.JobClient: Reduce output records=93564
14/04/28 22:27:36 INFO mapred.JobClient: Spilled Records=442190
14/04/28 22:27:36 INFO mapred.JobClient: Map output bytes=31005336
14/04/28 22:27:36 INFO mapred.JobClient: Total committed heap usage (bytes)=418766848
14/04/28 22:27:36 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:36 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:36 INFO mapred.JobClient: SPLIT_RAW_BYTES=139
14/04/28 22:27:36 INFO mapred.JobClient: Map output records=2583778
14/04/28 22:27:36 INFO mapred.JobClient: Combine input records=2838840
14/04/28 22:27:36 INFO mapred.JobClient: Reduce input records=93564
14/04/28 22:27:36 INFO vectorizer.SparseVectorsFromSequenceFiles: Pruning
14/04/28 22:27:36 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:36 INFO filecache.TrackerDistributedCacheManager: Creating frequency.file-0 in /tmp/hadoop-scott/mapred/local/archive/-530728777293952226_-346693812_683401803/file/tmp/mahout-work-scott/20news-vectors-work-8956470388650991617 with rwxr-xr-x
14/04/28 22:27:36 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/20news-vectors/frequency.file-0 as /tmp/hadoop-scott/mapred/local/archive/-530728777293952226_-346693812_683401803/file/tmp/mahout-work-scott/20news-vectors/frequency.file-0
14/04/28 22:27:36 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/20news-vectors/frequency.file-0 as /tmp/hadoop-scott/mapred/local/archive/-530728777293952226_-346693812_683401803/file/tmp/mahout-work-scott/20news-vectors/frequency.file-0
14/04/28 22:27:36 INFO mapred.JobClient: Running job: job_local948597543_0006
14/04/28 22:27:36 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:36 INFO mapred.LocalJobRunner: Starting task: attempt_local948597543_0006_m_000000_0
14/04/28 22:27:36 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2a839589
14/04/28 22:27:36 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/tf-vectors-toprune/part-r-00000:0+29314118
14/04/28 22:27:36 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:36 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:36 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:37 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:37 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:37 INFO compress.CodecPool: Got brand-new compressor
14/04/28 22:27:39 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:39 INFO mapred.Task: Task:attempt_local948597543_0006_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:39 INFO mapred.LocalJobRunner:
14/04/28 22:27:39 INFO mapred.Task: Task 'attempt_local948597543_0006_m_000000_0' done.
14/04/28 22:27:39 INFO mapred.LocalJobRunner: Finishing task: attempt_local948597543_0006_m_000000_0
14/04/28 22:27:39 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:39 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@40c5752e
14/04/28 22:27:39 INFO mapred.LocalJobRunner:
14/04/28 22:27:39 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:39 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7741581 bytes
14/04/28 22:27:39 INFO mapred.LocalJobRunner:
14/04/28 22:27:39 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:40 INFO mapred.Task: Task:attempt_local948597543_0006_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:40 INFO mapred.LocalJobRunner:
14/04/28 22:27:40 INFO mapred.Task: Task attempt_local948597543_0006_r_000000_0 is allowed to commit now
14/04/28 22:27:40 INFO output.FileOutputCommitter: Saved output of task 'attempt_local948597543_0006_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/tf-vectors-partial/partial-0
14/04/28 22:27:40 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:40 INFO mapred.Task: Task 'attempt_local948597543_0006_r_000000_0' done.
14/04/28 22:27:40 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:40 INFO mapred.JobClient: Job complete: job_local948597543_0006
14/04/28 22:27:40 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:40 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:40 INFO mapred.JobClient: Bytes Written=28913427
14/04/28 22:27:40 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:40 INFO mapred.JobClient: Bytes Read=29543146
14/04/28 22:27:40 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:40 INFO mapred.JobClient: FILE_BYTES_READ=519462834
14/04/28 22:27:40 INFO mapred.JobClient: FILE_BYTES_WRITTEN=405572535
14/04/28 22:27:40 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:40 INFO mapred.JobClient: Reduce input groups=18846
14/04/28 22:27:40 INFO mapred.JobClient: Map output materialized bytes=7741585
14/04/28 22:27:40 INFO mapred.JobClient: Combine output records=0
14/04/28 22:27:40 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:40 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:40 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:40 INFO mapred.JobClient: Reduce output records=18846
14/04/28 22:27:40 INFO mapred.JobClient: Spilled Records=37692
14/04/28 22:27:40 INFO mapred.JobClient: Map output bytes=28984080
14/04/28 22:27:40 INFO mapred.JobClient: Total committed heap usage (bytes)=262684672
14/04/28 22:27:40 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:40 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:40 INFO mapred.JobClient: SPLIT_RAW_BYTES=139
14/04/28 22:27:40 INFO mapred.JobClient: Map output records=18846
14/04/28 22:27:40 INFO mapred.JobClient: Combine input records=0
14/04/28 22:27:40 INFO mapred.JobClient: Reduce input records=18846
14/04/28 22:27:40 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:41 INFO mapred.JobClient: Running job: job_local2145934728_0007
14/04/28 22:27:41 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:41 INFO mapred.LocalJobRunner: Starting task: attempt_local2145934728_0007_m_000000_0
14/04/28 22:27:41 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@380bf630
14/04/28 22:27:41 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/tf-vectors-partial/partial-0/part-r-00000:0+28689283
14/04/28 22:27:41 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:41 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:41 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:41 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:42 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:42 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:42 INFO mapred.Task: Task:attempt_local2145934728_0007_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:42 INFO mapred.LocalJobRunner:
14/04/28 22:27:42 INFO mapred.Task: Task 'attempt_local2145934728_0007_m_000000_0' done.
14/04/28 22:27:42 INFO mapred.LocalJobRunner: Finishing task: attempt_local2145934728_0007_m_000000_0
14/04/28 22:27:42 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:42 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@72ebd7a9
14/04/28 22:27:42 INFO mapred.LocalJobRunner:
14/04/28 22:27:42 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:42 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 28437746 bytes
14/04/28 22:27:42 INFO mapred.LocalJobRunner:
14/04/28 22:27:43 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:44 INFO mapred.Task: Task:attempt_local2145934728_0007_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:44 INFO mapred.LocalJobRunner:
14/04/28 22:27:44 INFO mapred.Task: Task attempt_local2145934728_0007_r_000000_0 is allowed to commit now
14/04/28 22:27:44 INFO output.FileOutputCommitter: Saved output of task 'attempt_local2145934728_0007_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/tf-vectors
14/04/28 22:27:44 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:44 INFO mapred.Task: Task 'attempt_local2145934728_0007_r_000000_0' done.
14/04/28 22:27:45 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:45 INFO mapred.JobClient: Job complete: job_local2145934728_0007
14/04/28 22:27:45 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:45 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:45 INFO mapred.JobClient: Bytes Written=28913427
14/04/28 22:27:45 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:45 INFO mapred.JobClient: Bytes Read=28913427
14/04/28 22:27:45 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:45 INFO mapred.JobClient: FILE_BYTES_READ=618313392
14/04/28 22:27:45 INFO mapred.JobClient: FILE_BYTES_WRITTEN=523342243
14/04/28 22:27:45 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:45 INFO mapred.JobClient: Reduce input groups=18846
14/04/28 22:27:45 INFO mapred.JobClient: Map output materialized bytes=28437750
14/04/28 22:27:45 INFO mapred.JobClient: Combine output records=0
14/04/28 22:27:45 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:45 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:45 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:45 INFO mapred.JobClient: Reduce output records=18846
14/04/28 22:27:45 INFO mapred.JobClient: Spilled Records=37692
14/04/28 22:27:45 INFO mapred.JobClient: Map output bytes=28362505
14/04/28 22:27:45 INFO mapred.JobClient: Total committed heap usage (bytes)=262684672
14/04/28 22:27:45 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:45 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:45 INFO mapred.JobClient: SPLIT_RAW_BYTES=149
14/04/28 22:27:45 INFO mapred.JobClient: Map output records=18846
14/04/28 22:27:45 INFO mapred.JobClient: Combine input records=0
14/04/28 22:27:45 INFO mapred.JobClient: Reduce input records=18846
14/04/28 22:27:45 INFO common.HadoopUtil: Deleting /tmp/mahout-work-scott/20news-vectors/tf-vectors-partial
14/04/28 22:27:45 INFO common.HadoopUtil: Deleting /tmp/mahout-work-scott/20news-vectors/tf-vectors-toprune
14/04/28 22:27:45 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:45 INFO filecache.TrackerDistributedCacheManager: Creating frequency.file-0 in /tmp/hadoop-scott/mapred/local/archive/6460561726167307151_-346693812_683401803/file/tmp/mahout-work-scott/20news-vectors-work--4586898281651577675 with rwxr-xr-x
14/04/28 22:27:45 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/20news-vectors/frequency.file-0 as /tmp/hadoop-scott/mapred/local/archive/6460561726167307151_-346693812_683401803/file/tmp/mahout-work-scott/20news-vectors/frequency.file-0
14/04/28 22:27:45 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/20news-vectors/frequency.file-0 as /tmp/hadoop-scott/mapred/local/archive/6460561726167307151_-346693812_683401803/file/tmp/mahout-work-scott/20news-vectors/frequency.file-0
14/04/28 22:27:45 INFO mapred.JobClient: Running job: job_local1419874061_0008
14/04/28 22:27:45 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:45 INFO mapred.LocalJobRunner: Starting task: attempt_local1419874061_0008_m_000000_0
14/04/28 22:27:45 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76a8d33
14/04/28 22:27:45 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/tf-vectors/part-r-00000:0+28689283
14/04/28 22:27:45 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:45 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:45 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:46 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:46 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:46 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:46 INFO mapred.Task: Task:attempt_local1419874061_0008_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:46 INFO mapred.LocalJobRunner:
14/04/28 22:27:46 INFO mapred.Task: Task 'attempt_local1419874061_0008_m_000000_0' done.
14/04/28 22:27:46 INFO mapred.LocalJobRunner: Finishing task: attempt_local1419874061_0008_m_000000_0
14/04/28 22:27:46 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:46 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@c1599fc
14/04/28 22:27:46 INFO mapred.LocalJobRunner:
14/04/28 22:27:46 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:46 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 28437746 bytes
14/04/28 22:27:46 INFO mapred.LocalJobRunner:
14/04/28 22:27:47 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:47 INFO mapred.Task: Task:attempt_local1419874061_0008_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:47 INFO mapred.LocalJobRunner:
14/04/28 22:27:47 INFO mapred.Task: Task attempt_local1419874061_0008_r_000000_0 is allowed to commit now
14/04/28 22:27:47 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1419874061_0008_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/partial-vectors-0
14/04/28 22:27:47 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:47 INFO mapred.Task: Task 'attempt_local1419874061_0008_r_000000_0' done.
14/04/28 22:27:48 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:48 INFO mapred.JobClient: Job complete: job_local1419874061_0008
14/04/28 22:27:48 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:48 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:48 INFO mapred.JobClient: Bytes Written=28913427
14/04/28 22:27:48 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:48 INFO mapred.JobClient: Bytes Read=28913427
14/04/28 22:27:48 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:48 INFO mapred.JobClient: FILE_BYTES_READ=741669737
14/04/28 22:27:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=644925879
14/04/28 22:27:48 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:48 INFO mapred.JobClient: Reduce input groups=18846
14/04/28 22:27:48 INFO mapred.JobClient: Map output materialized bytes=28437750
14/04/28 22:27:48 INFO mapred.JobClient: Combine output records=0
14/04/28 22:27:48 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:48 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:48 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:48 INFO mapred.JobClient: Reduce output records=18846
14/04/28 22:27:48 INFO mapred.JobClient: Spilled Records=37692
14/04/28 22:27:48 INFO mapred.JobClient: Map output bytes=28362505
14/04/28 22:27:48 INFO mapred.JobClient: Total committed heap usage (bytes)=262684672
14/04/28 22:27:48 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:48 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:48 INFO mapred.JobClient: SPLIT_RAW_BYTES=131
14/04/28 22:27:48 INFO mapred.JobClient: Map output records=18846
14/04/28 22:27:48 INFO mapred.JobClient: Combine input records=0
14/04/28 22:27:48 INFO mapred.JobClient: Reduce input records=18846
14/04/28 22:27:48 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:27:48 INFO mapred.JobClient: Running job: job_local1419986101_0009
14/04/28 22:27:48 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:27:48 INFO mapred.LocalJobRunner: Starting task: attempt_local1419986101_0009_m_000000_0
14/04/28 22:27:48 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@de8e8dd
14/04/28 22:27:48 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-vectors/partial-vectors-0/part-r-00000:0+28689283
14/04/28 22:27:48 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:27:48 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:27:48 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:27:49 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:27:49 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:27:49 INFO mapred.MapTask: Finished spill 0
14/04/28 22:27:49 INFO mapred.Task: Task:attempt_local1419986101_0009_m_000000_0 is done. And is in the process of commiting
14/04/28 22:27:49 INFO mapred.LocalJobRunner:
14/04/28 22:27:49 INFO mapred.Task: Task 'attempt_local1419986101_0009_m_000000_0' done.
14/04/28 22:27:49 INFO mapred.LocalJobRunner: Finishing task: attempt_local1419986101_0009_m_000000_0
14/04/28 22:27:49 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:27:49 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d26c673
14/04/28 22:27:49 INFO mapred.LocalJobRunner:
14/04/28 22:27:49 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:27:49 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 28437746 bytes
14/04/28 22:27:49 INFO mapred.LocalJobRunner:
14/04/28 22:27:50 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:27:50 INFO mapred.Task: Task:attempt_local1419986101_0009_r_000000_0 is done. And is in the process of commiting
14/04/28 22:27:50 INFO mapred.LocalJobRunner:
14/04/28 22:27:50 INFO mapred.Task: Task attempt_local1419986101_0009_r_000000_0 is allowed to commit now
14/04/28 22:27:50 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1419986101_0009_r_000000_0' to /tmp/mahout-work-scott/20news-vectors/tfidf-vectors
14/04/28 22:27:50 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:27:50 INFO mapred.Task: Task 'attempt_local1419986101_0009_r_000000_0' done.
14/04/28 22:27:51 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:27:51 INFO mapred.JobClient: Job complete: job_local1419986101_0009
14/04/28 22:27:51 INFO mapred.JobClient: Counters: 20
14/04/28 22:27:51 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:27:51 INFO mapred.JobClient: Bytes Written=28913427
14/04/28 22:27:51 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:27:51 INFO mapred.JobClient: Bytes Read=28913427
14/04/28 22:27:51 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:27:51 INFO mapred.JobClient: FILE_BYTES_READ=861216438
14/04/28 22:27:51 INFO mapred.JobClient: FILE_BYTES_WRITTEN=762696669
14/04/28 22:27:51 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:27:51 INFO mapred.JobClient: Reduce input groups=18846
14/04/28 22:27:51 INFO mapred.JobClient: Map output materialized bytes=28437750
14/04/28 22:27:51 INFO mapred.JobClient: Combine output records=0
14/04/28 22:27:51 INFO mapred.JobClient: Map input records=18846
14/04/28 22:27:51 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:27:51 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:27:51 INFO mapred.JobClient: Reduce output records=18846
14/04/28 22:27:51 INFO mapred.JobClient: Spilled Records=37692
14/04/28 22:27:51 INFO mapred.JobClient: Map output bytes=28362505
14/04/28 22:27:51 INFO mapred.JobClient: Total committed heap usage (bytes)=262684672
14/04/28 22:27:51 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:27:51 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:27:51 INFO mapred.JobClient: SPLIT_RAW_BYTES=138
14/04/28 22:27:51 INFO mapred.JobClient: Map output records=18846
14/04/28 22:27:51 INFO mapred.JobClient: Combine input records=0
14/04/28 22:27:51 INFO mapred.JobClient: Reduce input records=18846
14/04/28 22:27:51 INFO common.HadoopUtil: Deleting /tmp/mahout-work-scott/20news-vectors/partial-vectors-0
14/04/28 22:27:51 INFO driver.MahoutDriver: Program took 52187 ms (Minutes: 0.8697833333333334)
+ echo 'Creating training and holdout set with a random 80-20 split of the generated vector dataset'
Creating training and holdout set with a random 80-20 split of the generated vector dataset
+ ./bin/mahout split -i /tmp/mahout-work-scott/20news-vectors/tfidf-vectors --trainingOutput /tmp/mahout-work-scott/20news-train-vectors --testOutput /tmp/mahout-work-scott/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.

MAHOUT_LOCAL is set, running locally
14/04/28 22:27:52 WARN driver.MahoutDriver: No split.props found on classpath, will use command-line arguments only
14/04/28 22:27:52 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-scott/20news-vectors/tfidf-vectors], --method=[sequential], --overwrite=null, --randomSelectionPct=[40], --sequenceFiles=null, --startPhase=[0], --tempDir=[temp], --testOutput=[/tmp/mahout-work-scott/20news-test-vectors], --trainingOutput=[/tmp/mahout-work-scott/20news-train-vectors]}
14/04/28 22:27:55 INFO utils.SplitInput: part-r-00000 has 162419 lines
14/04/28 22:27:55 INFO utils.SplitInput: part-r-00000 test split size is 64968 based on random selection percentage 40
14/04/28 22:27:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/28 22:27:55 INFO compress.CodecPool: Got brand-new compressor
14/04/28 22:27:55 INFO compress.CodecPool: Got brand-new compressor
14/04/28 22:28:01 INFO utils.SplitInput: file: part-r-00000, input: 162419 train: 11218, test: 7628 starting at 0
14/04/28 22:28:01 INFO driver.MahoutDriver: Program took 8560 ms (Minutes: 0.14266666666666666)
+ echo 'Training Naive Bayes model'
Training Naive Bayes model
+ ./bin/mahout trainnb -i /tmp/mahout-work-scott/20news-train-vectors -el -o /tmp/mahout-work-scott/model -li /tmp/mahout-work-scott/labelindex -ow -c
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
14/04/28 22:28:01 WARN driver.MahoutDriver: No trainnb.props found on classpath, will use command-line arguments only
14/04/28 22:28:02 INFO common.AbstractJob: Command line arguments: {--alphaI=[1.0], --endPhase=[2147483647], --extractLabels=null, --input=[/tmp/mahout-work-scott/20news-train-vectors], --labelIndex=[/tmp/mahout-work-scott/labelindex], --output=[/tmp/mahout-work-scott/model], --overwrite=null, --startPhase=[0], --tempDir=[temp], --trainComplementary=null}
14/04/28 22:28:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/28 22:28:02 INFO compress.CodecPool: Got brand-new decompressor
14/04/28 22:28:06 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:28:06 INFO filecache.TrackerDistributedCacheManager: Creating labelindex in /tmp/hadoop-scott/mapred/local/archive/-4321922932105260197_436852958_683431803/file/tmp/mahout-work-scott-work--4826842676929308456 with rwxr-xr-x
14/04/28 22:28:06 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/labelindex as /tmp/hadoop-scott/mapred/local/archive/-4321922932105260197_436852958_683431803/file/tmp/mahout-work-scott/labelindex
14/04/28 22:28:06 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/labelindex as /tmp/hadoop-scott/mapred/local/archive/-4321922932105260197_436852958_683431803/file/tmp/mahout-work-scott/labelindex
14/04/28 22:28:06 INFO mapred.JobClient: Running job: job_local809281093_0001
14/04/28 22:28:07 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:28:07 INFO mapred.LocalJobRunner: Starting task: attempt_local809281093_0001_m_000000_0
14/04/28 22:28:07 INFO util.ProcessTree: setsid exited with exit code 0
14/04/28 22:28:07 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@44d0d6fd
14/04/28 22:28:07 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-train-vectors/part-r-00000:0+12644280
14/04/28 22:28:07 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:28:07 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:28:07 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:28:07 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:28:10 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:28:10 INFO compress.CodecPool: Got brand-new compressor
14/04/28 22:28:11 INFO mapred.MapTask: Finished spill 0
14/04/28 22:28:11 INFO mapred.Task: Task:attempt_local809281093_0001_m_000000_0 is done. And is in the process of commiting
14/04/28 22:28:11 INFO mapred.LocalJobRunner:
14/04/28 22:28:11 INFO mapred.Task: Task 'attempt_local809281093_0001_m_000000_0' done.
14/04/28 22:28:11 INFO mapred.LocalJobRunner: Finishing task: attempt_local809281093_0001_m_000000_0
14/04/28 22:28:11 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:28:11 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@541569f1
14/04/28 22:28:11 INFO mapred.LocalJobRunner:
14/04/28 22:28:11 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:28:11 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1539370 bytes
14/04/28 22:28:11 INFO mapred.LocalJobRunner:
14/04/28 22:28:12 INFO mapred.Task: Task:attempt_local809281093_0001_r_000000_0 is done. And is in the process of commiting
14/04/28 22:28:12 INFO mapred.LocalJobRunner:
14/04/28 22:28:12 INFO mapred.Task: Task attempt_local809281093_0001_r_000000_0 is allowed to commit now
14/04/28 22:28:12 INFO output.FileOutputCommitter: Saved output of task 'attempt_local809281093_0001_r_000000_0' to temp/summedObservations
14/04/28 22:28:12 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:28:12 INFO mapred.Task: Task 'attempt_local809281093_0001_r_000000_0' done.
14/04/28 22:28:12 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:28:12 INFO mapred.JobClient: Job complete: job_local809281093_0001
14/04/28 22:28:12 INFO mapred.JobClient: Counters: 20
14/04/28 22:28:12 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:28:12 INFO mapred.JobClient: Bytes Written=2811291
14/04/28 22:28:12 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:28:12 INFO mapred.JobClient: Bytes Read=12743072
14/04/28 22:28:12 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:28:12 INFO mapred.JobClient: FILE_BYTES_READ=55453850
14/04/28 22:28:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=8963149
14/04/28 22:28:12 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:28:12 INFO mapred.JobClient: Reduce input groups=20
14/04/28 22:28:12 INFO mapred.JobClient: Map output materialized bytes=1539374
14/04/28 22:28:12 INFO mapred.JobClient: Combine output records=20
14/04/28 22:28:12 INFO mapred.JobClient: Map input records=11218
14/04/28 22:28:12 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:28:12 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:28:12 INFO mapred.JobClient: Reduce output records=20
14/04/28 22:28:12 INFO mapred.JobClient: Spilled Records=40
14/04/28 22:28:12 INFO mapred.JobClient: Map output bytes=16584989
14/04/28 22:28:12 INFO mapred.JobClient: Total committed heap usage (bytes)=262676480
14/04/28 22:28:12 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:28:12 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:28:12 INFO mapred.JobClient: SPLIT_RAW_BYTES=126
14/04/28 22:28:12 INFO mapred.JobClient: Map output records=11218
14/04/28 22:28:12 INFO mapred.JobClient: Combine input records=11218
14/04/28 22:28:12 INFO mapred.JobClient: Reduce input records=20
14/04/28 22:28:12 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:28:12 INFO filecache.TrackerDistributedCacheManager: Creating labelindex in /tmp/hadoop-scott/mapred/local/archive/-1966525460004222018_436852958_683431803/file/tmp/mahout-work-scott-work-9068526167211295814 with rwxr-xr-x
14/04/28 22:28:12 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/labelindex as /tmp/hadoop-scott/mapred/local/archive/-1966525460004222018_436852958_683431803/file/tmp/mahout-work-scott/labelindex
14/04/28 22:28:12 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/labelindex as /tmp/hadoop-scott/mapred/local/archive/-1966525460004222018_436852958_683431803/file/tmp/mahout-work-scott/labelindex
14/04/28 22:28:12 INFO mapred.JobClient: Running job: job_local404673375_0002
14/04/28 22:28:12 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:28:12 INFO mapred.LocalJobRunner: Starting task: attempt_local404673375_0002_m_000000_0
14/04/28 22:28:12 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@14a8ca58
14/04/28 22:28:13 INFO mapred.MapTask: Processing split: file:/opt/mahout-0.9/temp/summedObservations/part-r-00000:0+2789487
14/04/28 22:28:13 INFO mapred.MapTask: io.sort.mb = 100
14/04/28 22:28:13 INFO mapred.MapTask: data buffer = 79691776/99614720
14/04/28 22:28:13 INFO mapred.MapTask: record buffer = 262144/327680
14/04/28 22:28:13 INFO mapred.MapTask: Starting flush of map output
14/04/28 22:28:13 INFO mapred.MapTask: Finished spill 0
14/04/28 22:28:13 INFO mapred.Task: Task:attempt_local404673375_0002_m_000000_0 is done. And is in the process of commiting
14/04/28 22:28:13 INFO mapred.LocalJobRunner:
14/04/28 22:28:13 INFO mapred.Task: Task 'attempt_local404673375_0002_m_000000_0' done.
14/04/28 22:28:13 INFO mapred.LocalJobRunner: Finishing task: attempt_local404673375_0002_m_000000_0
14/04/28 22:28:13 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:28:13 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@51a3879b
14/04/28 22:28:13 INFO mapred.LocalJobRunner:
14/04/28 22:28:13 INFO mapred.Merger: Merging 1 sorted segments
14/04/28 22:28:13 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 454990 bytes
14/04/28 22:28:13 INFO mapred.LocalJobRunner:
14/04/28 22:28:13 INFO mapred.Task: Task:attempt_local404673375_0002_r_000000_0 is done. And is in the process of commiting
14/04/28 22:28:13 INFO mapred.LocalJobRunner:
14/04/28 22:28:13 INFO mapred.Task: Task attempt_local404673375_0002_r_000000_0 is allowed to commit now
14/04/28 22:28:13 INFO output.FileOutputCommitter: Saved output of task 'attempt_local404673375_0002_r_000000_0' to temp/weights
14/04/28 22:28:13 INFO mapred.LocalJobRunner: reduce > reduce
14/04/28 22:28:13 INFO mapred.Task: Task 'attempt_local404673375_0002_r_000000_0' done.
14/04/28 22:28:13 INFO mapred.JobClient: map 100% reduce 100%
14/04/28 22:28:13 INFO mapred.JobClient: Job complete: job_local404673375_0002
14/04/28 22:28:13 INFO mapred.JobClient: Counters: 20
14/04/28 22:28:13 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:28:13 INFO mapred.JobClient: Bytes Written=929301
14/04/28 22:28:13 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:28:13 INFO mapred.JobClient: Bytes Read=2811291
14/04/28 22:28:13 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:28:13 INFO mapred.JobClient: FILE_BYTES_READ=66011632
14/04/28 22:28:13 INFO mapred.JobClient: FILE_BYTES_WRITTEN=16685863
14/04/28 22:28:13 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:28:13 INFO mapred.JobClient: Reduce input groups=2
14/04/28 22:28:13 INFO mapred.JobClient: Map output materialized bytes=454994
14/04/28 22:28:13 INFO mapred.JobClient: Combine output records=2
14/04/28 22:28:13 INFO mapred.JobClient: Map input records=20
14/04/28 22:28:14 INFO mapred.JobClient: Reduce shuffle bytes=0
14/04/28 22:28:14 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:28:14 INFO mapred.JobClient: Reduce output records=2
14/04/28 22:28:14 INFO mapred.JobClient: Spilled Records=4
14/04/28 22:28:14 INFO mapred.JobClient: Map output bytes=921963
14/04/28 22:28:14 INFO mapred.JobClient: Total committed heap usage (bytes)=352329728
14/04/28 22:28:14 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:28:14 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:28:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=122
14/04/28 22:28:14 INFO mapred.JobClient: Map output records=2
14/04/28 22:28:14 INFO mapred.JobClient: Combine input records=2
14/04/28 22:28:14 INFO mapred.JobClient: Reduce input records=2
14/04/28 22:28:14 INFO driver.MahoutDriver: Program took 12490 ms (Minutes: 0.20816666666666667)
+ echo 'Self testing on training set'
Self testing on training set
+ ./bin/mahout testnb -i /tmp/mahout-work-scott/20news-train-vectors -m /tmp/mahout-work-scott/model -l /tmp/mahout-work-scott/labelindex -ow -o /tmp/mahout-work-scott/20news-testing -c
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.

MAHOUT_LOCAL is set, running locally
14/04/28 22:28:15 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only
14/04/28 22:28:15 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-scott/20news-train-vectors], --labelIndex=[/tmp/mahout-work-scott/labelindex], --model=[/tmp/mahout-work-scott/model], --output=[/tmp/mahout-work-scott/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp], --testComplementary=null}
14/04/28 22:28:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/28 22:28:16 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:28:16 INFO filecache.TrackerDistributedCacheManager: Creating model in /tmp/hadoop-scott/mapred/local/archive/-814385045777060146_-465374039_683439803/file/tmp/mahout-work-scott-work-2724059346679748531 with rwxr-xr-x
14/04/28 22:28:16 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/model as /tmp/hadoop-scott/mapred/local/archive/-814385045777060146_-465374039_683439803/file/tmp/mahout-work-scott/model
14/04/28 22:28:16 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/model as /tmp/hadoop-scott/mapred/local/archive/-814385045777060146_-465374039_683439803/file/tmp/mahout-work-scott/model
14/04/28 22:28:16 INFO mapred.JobClient: Running job: job_local1537490313_0001
14/04/28 22:28:16 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:28:16 INFO mapred.LocalJobRunner: Starting task: attempt_local1537490313_0001_m_000000_0
14/04/28 22:28:17 INFO util.ProcessTree: setsid exited with exit code 0
14/04/28 22:28:17 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d702d0
14/04/28 22:28:17 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-train-vectors/part-r-00000:0+12644280
14/04/28 22:28:17 INFO compress.CodecPool: Got brand-new decompressor
14/04/28 22:28:17 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:28:23 INFO mapred.LocalJobRunner:
14/04/28 22:28:23 INFO mapred.JobClient: map 49% reduce 0%
14/04/28 22:28:26 INFO mapred.LocalJobRunner:
14/04/28 22:28:26 INFO mapred.JobClient: map 83% reduce 0%
14/04/28 22:28:28 INFO mapred.Task: Task:attempt_local1537490313_0001_m_000000_0 is done. And is in the process of commiting
14/04/28 22:28:28 INFO mapred.LocalJobRunner:
14/04/28 22:28:28 INFO mapred.Task: Task attempt_local1537490313_0001_m_000000_0 is allowed to commit now
14/04/28 22:28:28 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1537490313_0001_m_000000_0' to /tmp/mahout-work-scott/20news-testing
14/04/28 22:28:28 INFO mapred.LocalJobRunner:
14/04/28 22:28:28 INFO mapred.Task: Task 'attempt_local1537490313_0001_m_000000_0' done.
14/04/28 22:28:28 INFO mapred.LocalJobRunner: Finishing task: attempt_local1537490313_0001_m_000000_0
14/04/28 22:28:28 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:28:28 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:28:28 INFO mapred.JobClient: Job complete: job_local1537490313_0001
14/04/28 22:28:28 INFO mapred.JobClient: Counters: 12
14/04/28 22:28:28 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:28:28 INFO mapred.JobClient: Bytes Written=2129857
14/04/28 22:28:28 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:28:28 INFO mapred.JobClient: Bytes Read=12743072
14/04/28 22:28:28 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:28:28 INFO mapred.JobClient: FILE_BYTES_READ=21692592
14/04/28 22:28:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=7404146
14/04/28 22:28:28 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:28:28 INFO mapred.JobClient: Map input records=11218
14/04/28 22:28:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:28:28 INFO mapred.JobClient: Spilled Records=0
14/04/28 22:28:28 INFO mapred.JobClient: Total committed heap usage (bytes)=31653888
14/04/28 22:28:28 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:28:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:28:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=126
14/04/28 22:28:28 INFO mapred.JobClient: Map output records=11218
14/04/28 22:28:29 INFO test.TestNaiveBayesDriver: Complementary Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 11087 98.8322%
Incorrectly Classified Instances : 131 1.1678%
Total Classified Instances : 11218

=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
477 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 | 479 a = alt.atheism
0 544 3 1 1 2 0 0 0 0 0 2 0 0 1 0 0 0 0 0 | 554 b = comp.graphics
0 2 577 7 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 | 589 c = comp.os.ms-windows.misc
1 0 1 607 1 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 | 613 d = comp.sys.ibm.pc.hardware
0 1 2 1 549 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 | 556 e = comp.sys.mac.hardware
0 2 3 0 0 609 0 0 0 0 1 0 0 0 0 0 0 0 0 0 | 615 f = comp.windows.x
0 1 0 1 4 0 546 3 0 1 1 0 4 1 0 0 0 0 0 0 | 562 g = misc.forsale
0 0 0 0 0 0 0 573 4 0 0 0 1 0 0 0 0 0 0 1 | 579 h = rec.autos
0 0 0 0 0 0 1 0 598 0 0 0 0 0 0 0 0 0 0 0 | 599 i = rec.motorcycles
0 0 0 0 0 0 0 0 0 603 3 0 0 1 1 0 0 0 0 0 | 608 j = rec.sport.baseball
0 0 0 0 0 0 0 0 0 1 609 0 0 0 0 0 0 0 0 0 | 610 k = rec.sport.hockey
0 0 0 0 0 0 0 0 0 0 0 587 0 0 0 0 0 1 0 0 | 588 l = sci.crypt
0 0 0 4 0 0 1 0 1 0 0 0 576 0 2 0 0 0 0 0 | 584 m = sci.electronics
0 0 0 0 0 0 0 0 0 0 0 0 1 610 0 0 0 0 0 0 | 611 n = sci.med
0 0 0 0 0 0 0 0 0 0 0 0 0 1 585 0 0 0 0 0 | 586 o = sci.space
0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 591 0 0 0 0 | 593 p = soc.religion.christian
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 526 0 0 0 | 528 q = talk.politics.mideast
0 0 1 0 0 0 0 0 0 0 0 3 0 0 1 0 0 531 0 0 | 536 r = talk.politics.guns
17 0 0 0 0 0 0 0 0 0 0 0 0 1 2 6 2 1 334 1 | 364 s = talk.religion.misc
0 0 1 0 0 0 0 0 0 0 1 2 0 0 0 0 0 5 0 455 | 464 t = talk.politics.misc

=======================================================
Statistics
-------------------------------------------------------
Kappa 0.9797
Accuracy 98.8322%
Reliability 93.9898%
Reliability (standard deviation) 0.2161

14/04/28 22:28:29 INFO driver.MahoutDriver: Program took 14215 ms (Minutes: 0.23691666666666666)
+ echo 'Testing on holdout set'
Testing on holdout set
+ ./bin/mahout testnb -i /tmp/mahout-work-scott/20news-test-vectors -m /tmp/mahout-work-scott/model -l /tmp/mahout-work-scott/labelindex -ow -o /tmp/mahout-work-scott/20news-testing -c
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
14/04/28 22:28:30 WARN driver.MahoutDriver: No testnb.props found on classpath, will use command-line arguments only
14/04/28 22:28:30 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[/tmp/mahout-work-scott/20news-test-vectors], --labelIndex=[/tmp/mahout-work-scott/labelindex], --model=[/tmp/mahout-work-scott/model], --output=[/tmp/mahout-work-scott/20news-testing], --overwrite=null, --startPhase=[0], --tempDir=[temp], --testComplementary=null}
14/04/28 22:28:31 INFO common.HadoopUtil: Deleting /tmp/mahout-work-scott/20news-testing
14/04/28 22:28:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/04/28 22:28:31 INFO input.FileInputFormat: Total input paths to process : 1
14/04/28 22:28:31 INFO filecache.TrackerDistributedCacheManager: Creating model in /tmp/hadoop-scott/mapred/local/archive/2181784352082857606_-465374039_683439803/file/tmp/mahout-work-scott-work-2099002446635525681 with rwxr-xr-x
14/04/28 22:28:31 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/model as /tmp/hadoop-scott/mapred/local/archive/2181784352082857606_-465374039_683439803/file/tmp/mahout-work-scott/model
14/04/28 22:28:31 INFO filecache.TrackerDistributedCacheManager: Cached /tmp/mahout-work-scott/model as /tmp/hadoop-scott/mapred/local/archive/2181784352082857606_-465374039_683439803/file/tmp/mahout-work-scott/model
14/04/28 22:28:31 INFO mapred.JobClient: Running job: job_local1178427149_0001
14/04/28 22:28:31 INFO mapred.LocalJobRunner: Waiting for map tasks
14/04/28 22:28:31 INFO mapred.LocalJobRunner: Starting task: attempt_local1178427149_0001_m_000000_0
14/04/28 22:28:32 INFO util.ProcessTree: setsid exited with exit code 0
14/04/28 22:28:32 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d702d0
14/04/28 22:28:32 INFO mapred.MapTask: Processing split: file:/tmp/mahout-work-scott/20news-test-vectors/part-r-00000:0+8709952
14/04/28 22:28:32 INFO compress.CodecPool: Got brand-new decompressor
14/04/28 22:28:32 INFO mapred.JobClient: map 0% reduce 0%
14/04/28 22:28:38 INFO mapred.LocalJobRunner:
14/04/28 22:28:38 INFO mapred.JobClient: map 70% reduce 0%
14/04/28 22:28:39 INFO mapred.Task: Task:attempt_local1178427149_0001_m_000000_0 is done. And is in the process of commiting
14/04/28 22:28:39 INFO mapred.LocalJobRunner:
14/04/28 22:28:39 INFO mapred.Task: Task attempt_local1178427149_0001_m_000000_0 is allowed to commit now
14/04/28 22:28:39 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1178427149_0001_m_000000_0' to /tmp/mahout-work-scott/20news-testing
14/04/28 22:28:39 INFO mapred.LocalJobRunner:
14/04/28 22:28:39 INFO mapred.Task: Task 'attempt_local1178427149_0001_m_000000_0' done.
14/04/28 22:28:39 INFO mapred.LocalJobRunner: Finishing task: attempt_local1178427149_0001_m_000000_0
14/04/28 22:28:39 INFO mapred.LocalJobRunner: Map task executor complete.
14/04/28 22:28:39 INFO mapred.JobClient: map 100% reduce 0%
14/04/28 22:28:39 INFO mapred.JobClient: Job complete: job_local1178427149_0001
14/04/28 22:28:39 INFO mapred.JobClient: Counters: 12
14/04/28 22:28:39 INFO mapred.JobClient: File Output Format Counters
14/04/28 22:28:39 INFO mapred.JobClient: Bytes Written=1448327
14/04/28 22:28:39 INFO mapred.JobClient: File Input Format Counters
14/04/28 22:28:39 INFO mapred.JobClient: Bytes Read=8778008
14/04/28 22:28:39 INFO mapred.JobClient: FileSystemCounters
14/04/28 22:28:39 INFO mapred.JobClient: FILE_BYTES_READ=17727527
14/04/28 22:28:39 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6722613
14/04/28 22:28:39 INFO mapred.JobClient: Map-Reduce Framework
14/04/28 22:28:39 INFO mapred.JobClient: Map input records=7628
14/04/28 22:28:39 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
14/04/28 22:28:39 INFO mapred.JobClient: Spilled Records=0
14/04/28 22:28:39 INFO mapred.JobClient: Total committed heap usage (bytes)=31653888
14/04/28 22:28:39 INFO mapred.JobClient: CPU time spent (ms)=0
14/04/28 22:28:39 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
14/04/28 22:28:39 INFO mapred.JobClient: SPLIT_RAW_BYTES=125
14/04/28 22:28:39 INFO mapred.JobClient: Map output records=7628
14/04/28 22:28:40 INFO test.TestNaiveBayesDriver: Complementary Results:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 6815 89.3419%
Incorrectly Classified Instances : 813 10.6581%
Total Classified Instances : 7628

=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
291 0 0 0 0 0 0 0 0 1 0 0 1 1 4 6 1 2 12 1 | 320 a = alt.atheism
2 335 9 7 5 13 7 1 2 3 1 5 8 6 4 2 2 1 4 2 | 419 b = comp.graphics
0 12 296 26 7 14 10 0 0 4 5 3 5 8 3 0 1 2 0 0 | 396 c = comp.os.ms-windows.misc
0 11 10 275 15 9 15 3 1 1 2 1 10 4 3 1 2 4 1 1 | 369 d = comp.sys.ibm.pc.hardware
1 2 4 9 354 4 3 9 2 0 1 5 4 0 2 1 2 1 1 2 | 407 e = comp.sys.mac.hardware
2 6 3 4 1 341 1 1 2 0 0 6 0 0 5 0 1 0 0 0 | 373 f = comp.windows.x
0 5 5 28 4 6 292 18 9 2 9 2 13 10 4 2 1 1 0 2 | 413 g = misc.forsale
0 1 1 3 0 1 2 383 3 1 1 1 3 3 2 1 2 0 0 3 | 411 h = rec.autos
0 0 1 0 0 2 0 1 386 1 1 0 1 2 1 0 0 1 0 0 | 397 i = rec.motorcycles
0 0 0 0 0 0 2 0 1 370 9 0 1 1 0 2 0 0 0 0 | 386 j = rec.sport.baseball
0 0 0 0 0 0 0 0 0 3 386 0 0 0 0 0 0 0 0 0 | 389 k = rec.sport.hockey
0 0 1 0 0 1 0 0 0 1 0 394 2 0 2 1 0 0 0 1 | 403 l = sci.crypt
0 6 1 10 7 6 6 3 3 3 2 3 339 5 4 0 0 2 0 0 | 400 m = sci.electronics
0 0 2 2 0 2 0 2 0 1 2 0 2 360 5 0 1 0 0 0 | 379 n = sci.med
2 1 1 0 0 2 1 3 0 0 0 0 2 3 383 0 0 2 0 1 | 401 o = sci.space
4 0 0 1 0 0 0 0 0 0 0 0 1 2 0 394 2 0 0 0 | 404 p = soc.religion.christian
0 0 0 1 0 0 0 0 0 2 0 0 0 0 1 1 407 0 0 0 | 412 q = talk.politics.mideast
0 0 0 1 0 2 0 0 0 1 0 3 0 2 2 0 3 358 0 2 | 374 r = talk.politics.guns
25 1 0 0 0 0 0 0 0 0 1 1 0 2 2 25 0 9 189 9 | 264 s = talk.religion.misc
0 0 0 0 0 0 0 0 0 2 1 0 0 3 1 3 2 15 2 282 | 311 t = talk.politics.misc

=======================================================
Statistics
-------------------------------------------------------
Kappa 0.8607
Accuracy 89.3419%
Reliability 84.8609%
Reliability (standard deviation) 0.216

14/04/28 22:28:40 INFO driver.MahoutDriver: Program took 10135 ms (Minutes: 0.16891666666666666)

以hadoop模式运行20newsgroups示例

不要设置MAHOUT_LOCAL环境变量,并且将/tmp/mahout-work-scott/20news-all目录下的内容copy到HDFS系统中, 这里为了不去修改classify-20newsgroups.sh中的WORK_DIR变量,仍旧使用local模式下数据存放的路径作为HDFS路径,如果有不同的话那么就要修改WORK_DIR的值。

scott@master:/tmp/mahout-work-scott/20news-all$ hdfs dfs -mkdir -p /tmp/mahout-work-scott
scott@master:/tmp/mahout-work-scott/20news-all$ hdfs dfs -put /tmp/mahout-work-scott/20news-all /tmp/mahout-work-scott/20news-all

其实以上的操作也不是必须的,classify-20newsgroups.sh脚本会自动完成目录创建和文件拷贝。

复制的数据目录比较多,请耐心等待一会儿,然后运行classify-20newsgroups.sh

运行结果为

scott@master:/opt/mahout-0.9/examples/bin$ ./classify-20newsgroups.sh 
Please select a number to choose the corresponding task to run
1. cnaivebayes
2. naivebayes
3. sgd
4. clean -- cleans up the work area in /tmp/mahout-work-scott
Enter your choice : 1 ## 这里选择1
ok. You chose 1 and we'll use cnaivebayes
creating work directory at /tmp/mahout-work-scott
+ echo 'Preparing 20newsgroups data'
Preparing 20newsgroups data
+ rm -rf /tmp/mahout-work-scott/20news-all
+ mkdir /tmp/mahout-work-scott/20news-all
+ cp -R /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/alt.atheism /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.graphics /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.os.ms-windows.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.sys.ibm.pc.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.sys.mac.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/comp.windows.x /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/misc.forsale /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.autos /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.motorcycles /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.sport.baseball /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/rec.sport.hockey /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.crypt /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.electronics /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.med /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/sci.space /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/soc.religion.christian /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.politics.guns /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.politics.mideast /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.politics.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-test/talk.religion.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/alt.atheism /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.graphics /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.os.ms-windows.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.sys.ibm.pc.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.sys.mac.hardware /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/comp.windows.x /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/misc.forsale /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.autos /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.motorcycles /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.sport.baseball /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/rec.sport.hockey /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.crypt /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.electronics /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.med /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/sci.space /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/soc.religion.christian /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.politics.guns /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.politics.mideast /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.politics.misc /tmp/mahout-work-scott/20news-bydate/20news-bydate-train/talk.religion.misc /tmp/mahout-work-scott/20news-all
+ '[' /opt/hadoop-2.2.0 '!=' '' ']'
+ '[' '' == '' ']'
+ echo 'Copying 20newsgroups data to HDFS'
Copying 20newsgroups data to HDFS
+ set +e
+ /opt/hadoop-2.2.0/bin/hadoop dfs -rmr /tmp/mahout-work-scott/20news-all
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

rmr: DEPRECATED: Please use 'rm -r' instead.
14/04/28 22:58:58 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /tmp/mahout-work-scott/20news-all
+ set -e
+ /opt/hadoop-2.2.0/bin/hadoop dfs -put /tmp/mahout-work-scott/20news-all /tmp/mahout-work-scott/20news-all
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

中间省略。。。

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:103)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

发现最后报错了,Exception in thread “main” java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected(这个熟悉的异常,就是因为hadoop版本不一致而抛出的),由此可见mahout-0.9并不支持hadoop-2.2.0,去/opt/mahout-0.9/lib/hadoop目录下发现了hadoop-core-1.2.1.jar。接下来只好自己编译了。

接下来将会在后面的博文中,从源码构建支持hadoop-2.2.0的mahout-0.9.敬请期待!

参考文档

http://samchu.logdown.com/posts/192574-mahout-09-installation-verification-records
http://f.dataguru.cn/thread-190361-1-7.html
http://blog.csdn.net/fansy1990/article/details/23261633