directoryentry(mahout面试题)

之前看了Mahout官方示例 20news 的调用实现；于是想根据示例的流程实现其他例子。网上看到了一个关于天气适不适合打羽毛球的例子。

训练数据：

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

检测数据：

sunny，hot，high，weak

结果：

Yes=》 0.007039

No=》 0.027418

于是使用Java代码调用Mahout的工具类实现分类。

基本思想：

1. 构造分类数据。

2. 使用Mahout工具类进行训练，得到训练模型。

3。将要检测数据转换成vector数据。

4. 分类器对vector数据进行分类。

接下来贴下我的代码实现=》

1. 构造分类数据：

在hdfs主要创建一个文件夹路径 /zhoujainfeng/playtennis/input 并将分类文件夹 no 和 yes 的数据传到hdfs上面。

数据文件格式，如D1文件内容： Sunny Hot High Weak

2. 使用Mahout工具类进行训练，得到训练模型。

3。将要检测数据转换成vector数据。

4. 分类器对vector数据进行分类。

这三步，代码我就一次全贴出来；主要是两个类 PlayTennis1 和 BayesCheckData = =》

package myTesting.bayes;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.util.ToolRunner;

import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;

import org.apache.mahout.text.SequenceFilesFromDirectory;

import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;

public class PlayTennis1 {

private static final String WORK_DIR = "hdfs://192.168.9.72:9000/zhoujianfeng/playtennis";

* 测试代码

public static void main(String[] args) {

//将训练数据转换成 vector数据

makeTrainVector();

//产生训练模型

makeModel(false);

//测试检测数据

BayesCheckData.printResult();

}

public static void makeCheckVector(){

//将测试数据转换成序列化文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"testinput";

String output = WORK_DIR+Path.SEPARATOR+"tennis-test-seq";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean参数是，是否递归删除的意思

fs.delete(out, true);

}

SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();

String[] params = new String[]{"-i",input,"-o",output,"-ow"};

ToolRunner.run(sffd, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("文件序列化失败！");

System.exit(1);

}

//将序列化文件转换成向量文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"tennis-test-seq";

String output = WORK_DIR+Path.SEPARATOR+"tennis-test-vectors";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean参数是，是否递归删除的意思

fs.delete(out, true);

}

SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();

String[] params = new String[]{"-i",input,"-o",output,"-lnorm","-nv","-wt","tfidf"};

ToolRunner.run(svfsf, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("序列化文件转换成向量失败！");

System.out.println(2);

}

public static void makeTrainVector(){

//将测试数据转换成序列化文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"input";

String output = WORK_DIR+Path.SEPARATOR+"tennis-seq";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean参数是，是否递归删除的意思

fs.delete(out, true);

}

SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();

String[] params = new String[]{"-i",input,"-o",output,"-ow"};

ToolRunner.run(sffd, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("文件序列化失败！");

System.exit(1);

}

//将序列化文件转换成向量文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"tennis-seq";

String output = WORK_DIR+Path.SEPARATOR+"tennis-vectors";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean参数是，是否递归删除的意思

fs.delete(out, true);

}

SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();

String[] params = new String[]{"-i",input,"-o",output,"-lnorm","-nv","-wt","tfidf"};

ToolRunner.run(svfsf, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("序列化文件转换成向量失败！");

System.out.println(2);

}

public static void makeModel(boolean completelyNB){

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"tennis-vectors"+Path.SEPARATOR+"tfidf-vectors";

String model = WORK_DIR+Path.SEPARATOR+"model";

String labelindex = WORK_DIR+Path.SEPARATOR+"labelindex";

Path in = new Path(input);

Path out = new Path(model);

Path label = new Path(labelindex);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean参数是，是否递归删除的意思

fs.delete(out, true);

}

if(fs.exists(label)){

//boolean参数是，是否递归删除的意思

fs.delete(label, true);

}

TrainNaiveBayesJob tnbj = new TrainNaiveBayesJob();

String[] params =null;

if(completelyNB){

params = new String[]{"-i",input,"-el","-o",model,"-li",labelindex,"-ow","-c"};

}else{

params = new String[]{"-i",input,"-el","-o",model,"-li",labelindex,"-ow"};

}

ToolRunner.run(tnbj, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("生成训练模型失败！");

System.exit(3);

}

package myTesting.bayes;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.fs.PathFilter;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.mahout.classifier.naivebayes.BayesUtils;

import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;

import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;

import org.apache.mahout.common.Pair;

import org.apache.mahout.common.iterator.sequencefile.PathType;

import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;

import org.apache.mahout.math.RandomAccessSparseVector;

import org.apache.mahout.math.Vector;

import org.apache.mahout.math.Vector.Element;

import org.apache.mahout.vectorizer.TFIDF;

import com.google.common.collect.ConcurrentHashMultiset;

import com.google.common.collect.Multiset;

public class BayesCheckData {

private static StandardNaiveBayesClassifier classifier;

private static Map<String, Integer> dictionary;

private static Map<Integer, Long> documentFrequency;

private static Map<Integer, String> labelIndex;

public void init(Configuration conf){

try {

String modelPath = "/zhoujianfeng/playtennis/model";

String dictionaryPath = "/zhoujianfeng/playtennis/tennis-vectors/dictionary.file-0";

String documentFrequencyPath = "/zhoujianfeng/playtennis/tennis-vectors/df-count";

String labelIndexPath = "/zhoujianfeng/playtennis/labelindex";

dictionary = readDictionnary(conf, new Path(dictionaryPath));

documentFrequency = readDocumentFrequency(conf, new Path(documentFrequencyPath));

labelIndex = BayesUtils.readLabelIndex(conf, new Path(labelIndexPath));

NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), conf);

classifier = new StandardNaiveBayesClassifier(model);

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("检测数据构造成vectors初始化时报错。。。。");

System.exit(4);

}

/**

* 加载字典文件，Key: TermValue； Value：TermID

* @param conf

* @param dictionnaryDir

* @return

private static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryDir) {

Map<String, Integer> dictionnary = new HashMap<String, Integer>();

PathFilter filter = new PathFilter() {

@Override

public boolean accept(Path path) {

String name = path.getName();

return name.startsWith("dictionary.file");

}

};

for (Pair<Text, IntWritable> pair : new SequenceFileDirIterable<Text, IntWritable>(dictionnaryDir, PathType.LIST, filter, conf)) {

dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());

}

return dictionnary;

}

/**

* 加载df-count目录下TermDoc频率文件，Key: TermID； Value：DocFreq

* @param conf

* @param dictionnaryDir

* @return

private static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyDir) {

Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();

PathFilter filter = new PathFilter() {

@Override

public boolean accept(Path path) {

return path.getName().startsWith("part-r");

}

};

for (Pair<IntWritable, LongWritable> pair : new SequenceFileDirIterable<IntWritable, LongWritable>(documentFrequencyDir, PathType.LIST, filter, conf)) {

documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());

}

return documentFrequency;

}

public static String getCheckResult(){

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String classify = "NaN";

BayesCheckData cdv = new BayesCheckData();

cdv.init(conf);

System.out.println("init done...............");

Vector vector = new RandomAccessSparseVector(10000);

TFIDF tfidf = new TFIDF();

//sunny，hot，high，weak

Multiset<String> words = ConcurrentHashMultiset.create();

words.add("sunny",1);

words.add("hot",1);

words.add("high",1);

words.add("weak",1);

int documentCount = documentFrequency.get(-1).intValue(); // key=-1时表示总文档数

for (Multiset.Entry<String> entry : words.entrySet()) {

String word = entry.getElement();

int count = entry.getCount();

Integer wordId = dictionary.get(word); // 需要从dictionary.file-0文件（tf-vector）下得到wordID，

if (StringUtils.isEmpty(wordId.toString())){

continue;

}

if (documentFrequency.get(wordId) == null){

continue;

}

Long freq = documentFrequency.get(wordId);

double tfIdfValue = tfidf.calculate(count, freq.intValue(), 1, documentCount);

vector.setQuick(wordId, tfIdfValue);

}

// 利用贝叶斯算法开始分类,并提取得分最好的分类label

Vector resultVector = classifier.classifyFull(vector);

double bestScore = -Double.MAX_VALUE;

int bestCategoryId = -1;

for(Element element: resultVector.all()) {

int categoryId = element.index();

double score = element.get();

System.out.println("categoryId:"+categoryId+" score:"+score);

if (score > bestScore) {

bestScore = score;

bestCategoryId = categoryId;

}

classify = labelIndex.get(bestCategoryId)+"(categoryId="+bestCategoryId+")";

return classify;

}

public static void printResult(){

System.out.println("检测所属类别是："+getCheckResult());

}

释义：年相对价位指标;年相对价位;偏航角;离子

When the entry is not present, temporary files are created in the project directory where the. rpy file is stored.

当条目没有显现时，临时文件会在.rpy文件存储时在项目目录中得到创建。

关于all的短语有：

1、in all

总共 ; 总计 ; 共计 ; 总之

双语例句：

You can use the logo in all your promotional material.

你们可以将这个标志用在所有的宣传材料中。

2、select all

全选 ; 选择所有 ; 全部选择 ; 选择全部

双语例句：

Select All users and All logged in users.

选择所有用户和所有登录用户。

3、After all

毕竟 ; 究竟 ; 虽然这样 ; 终于

双语例句：

We still can't quite believe he's here with us after all this time

我们还是不太敢相信，过了这么久，他居然还和我们一起呆在这里。

4、all thumbs

笨手笨脚的 ; 一窍不通的 ; 满手都是大拇指 ; 手太笨的

双语例句：

I'm all thumbs this morning. I forgot my keys in mybedroom.

今天早上我笨手笨脚的。我把钥匙忘在我卧室了。

5、all round

周围 ; 处处 ; 全能的 ; 各方面

双语例句：

All round us was desert

我们周围全是沙漠。

6、Export All

导出所有 ; 导出全部 ; 全部导出 ; 出口所有

双语例句：

Export all directory entries in LDIF add entry format.

导出所有ldif添加条目格式的目录条目。

7、All times

古今中外 ; 总是 ; 一直 ; 随时

双语例句：

At all times you can change between contour and pointmode.

在任何时候都可以更改的轮廓和点模式。

8、Search All

全智能搜索 ; 全项搜索 ; 全项搜寻 ; 全项搜刮

双语例句：

All index data services can search all index files.

所有索引数据服务都可以搜索所有索引文件。

9、all together

一起 ; 总共 ; 同时 ; 总计

双语例句：

We have invited fifty people altogether.

我们共邀请了五十人

首先简述自己的系统配置：win8+ ubuntu14.04

linuxQQ 有各种版本，这里介绍两种：linuxQQ 和 wineQQ

1 ------linuxqq是QQ简化版，功能很少，界面很差，但是安装简单

下载地址：http://im.qq.com/qq/linux/ 可以选择对听版本的系统以及QQ 。这里建议下载tar.gz的版本，然后解压，执行./QQ就搞定了，很简单吧。

下载后运行命令： tar xzvf ************.tar.gz ////**号代表你下载的文件名称

然后进入对应的的解压好的文件里面执行命令： ./qq 就可以登陆QQ了，ok！！！

2------wineQQ基本上是和windows下QQ 是同步的。功能相当齐全。当然我是建议你安装这个类型的。

下载地址： http://www.longene.org/download/WineQQ2013-20131120-Longene.deb 这只是我编辑此文章的时候最新版本，如果你自己想获取最新版本，可以在此可以下载到最新版本的wineqq版本：下载地址为： http://www.longene.org/download/ 在列表里面寻找最新版本的吧！！！这种哦功能类型名字的-WineQQ2013SP6-20140102-Longene.deb 下载到自己知道的目录下

下载后在下载目录里面运行命令：sudo dpkg -i WineQQ2013SP6-20140102-Longene.deb

64位系统还需要运行以下命令：sudo apt-get install ia32-libs

3------运行

一般默认安装在/opt/longene/qq2014文件夹里面。进入这个目录，将QQ2013图表拖放到桌面或者侧边栏，就可以随心所欲的使用wineqq 了，ok！！！

如果你找不到安装的位置，可以搜索一下‘qq2012’，寻找目录，找到快捷方式。ok！！！

4------卸载

运行以下命令：

dpkg -r qq-for-wine

5 -----卸载wineqq

sudo dpkg --purge wine-qq2014-longeneteam

(正在读取数据库 ... 系统当前共安装有 198829 个文件和目录。)正在卸载 wine-qq2014-longeneteam ...* Backup QQ users directory to your HOME.......* Backup done!* ldconfig....* Remove the entry from system menu......* Remove desktop icon......正在清除 wine-qq2012-longeneteam 的配置文件 ...ls: 无法访问/opt/longene: 没有那个文件或目录* Removing /opt/longene...* Done.正在处理用于 bamfdaemon 的触发器...Rebuilding /usr/share/applications/bamf.index...正在处理用于 desktop-file-utils 的触发器...正在处理用于 gnome-menus 的触发器...

申明：盗墓 ....

directoryentry(mahout面试题)

相关阅读

发表评论取消回复

还没有评论，来说两句吧...