<转>影响Lucene对文档打分的四种方式

2012-11-14

影响Lucene对文档打分的四种方式

在索引阶段设置Document Boost和Field Boost，存储在(.nrm)文件中。

如果希望某些文档和某些域比其他的域更重要，如果此文档和此域包含所要查询的词则应该得分较高，则可以在索引阶段设定文档的boost和域的boost值。
这些值是在索引阶段就写入索引文件的，存储在标准化因子(.nrm)文件中，一旦设定，除非删除此文档，否则无法改变。
如果不进行设定，则Document Boost和Field Boost默认为1。
Document Boost及FieldBoost的设定方式如下：

Document doc = new Document();
Field f = new Field("contents", "hello world", Field.Store.NO, Field.Index.ANALYZED);
f.setBoost(100);
doc.add(f);
doc.setBoost(100);

两者是如何影响Lucene的文档打分的呢？
让我们首先来看一下Lucene的文档打分的公式：

score(q,d) = coord(q,d) · queryNorm(q) · ∑( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) )

Document Boost和Field Boost影响的是norm(t, d)，其公式如下：

norm(t,d) = doc.getBoost() · lengthNorm(field) · ∏f.getBoost()
field f in d named as t

它包括三个参数：

Document boost：此值越大，说明此文档越重要。
Field boost：此域越大，说明此域越重要。
lengthNorm(field) = (1.0 / Math.sqrt(numTerms))：一个域中包含的Term总数越多，也即文档越长，此值越小，文档越短，此值越大。

more >>

展开全文 >>

<转>Lucene打分公式的数学推导

2012-11-14

Lucene打分公式的数学推导

在进行Lucene的搜索过程解析之前，有必要单独的一张把Lucene score公式的推导，各部分的意义阐述一下。因为Lucene的搜索过程，很重要的一个步骤就是逐步的计算各部分的分数。

Lucene的打分公式非常复杂，如下：

在推导之前，先逐个介绍每部分的意义：
more >>

展开全文 >>

SOAP中Binding的四种样式

2012-09-18

SOAP中Binding的四种样式

在SOAP中由于在当初标准化过程比较短,并且采用了事实标准推动.导致了现在WSDL1.1中其实是有4种绑定的样式的.这四种样式生成的WSDL都有细微的差别.而了解它们之间的区别,对于我们生成或调用WebService是非常有帮助的.否则就有可能出现别人生成的WSDL,我们动态调用不了,又不晓得原因的情况.

这四种样式分别是:

RPC/Encoded
RPC/Literal
Document/Encoded
Document/Literal(Wrapper)

这四组样式其实可以分成Style和Use两个属性的排列组合.

RPC Style指定包含Web服务调用的XML节点，该节点以Web服务调用方法命名，XML节点依次包含方法调用的各个参数。
Document Style指定内直接包含消息，该消息没有SOAP格式限制。服务器的应用层负责将XML文档映射成内存对象（参数、方法调用等等）
Encoded Use表示XML的消息使用类型属性引用抽象数据类型，使用Section 5编码（SOAP规范第五章定义的编码）进行xml的序列化和反序列化。
Literal Use表示XML的消息使用类型属性或者Element元素引用具体的Schema定义，也就是说，根据具体的Schema将内存对象序列化成XML消息。

more >>

展开全文 >>

<转> A successful Git branching model

2012-08-18

A successful Git branching model

In this post I present the development model that I’ve introduced for all of my projects (both at work and private) about a year ago, and which has turned out to be very successful. I’ve been meaning to write about it for a while now, but I’ve never really found the time to do so thoroughly, until now. I won’t talk about any of the projects’ details, merely about the branching strategy and release management.

It focuses around Git as the tool for the versioning of all of our source code.

Why git?

For a thorough discussion on the pros and cons of Git compared to centralized source code control systems, see the web. There are plenty of flame wars going on there. As a developer, I prefer Git above all other tools around today. Git really changed the way developers think of merging and branching. From the classic CVS/Subversion world I came from, merging/branching has always been considered a bit scary (“beware of merge conflicts, they bite you!”) and something you only do every once in a while.

more >>

展开全文 >>