2010-07-31

xapian索引的term处理

Xapian 是一个用C++编写的全文检索程序，License是GPL，它的作用类似于Java的lucene。Xapian 的官方网站是http://xapian.org/,采用trac作为项目管理软件。如果想要了解更多则可以查看xapian的文档。Xapian的索引不支持中文切词，不管是单字切词还是多元切词，中文是同英文一样的处理方式。下面就来分析Xapian怎样索引英文文档。首先介绍一个迭代器 Utf8Iterator（unicode.h),它在索引的时候起着非常重要的作用。它有三个私有成员，其中其中 p表示指向字符数组的元素指针，end表示指向字符数组末端元素的下一个,seqlen表示当前字符所占的字节数:

const unsigned char *p;
const unsigned char *end;
mutable unsigned seqlen;

Utf8Lterator还有四个非常重要的操作符重载，索引时它们多次被调用。其中 * 表示操作符返回对应的unicode码， ++ 表示指针指向下一位字符， == 和 != 是判断迭代器是否相同:

unsigned Utf8Iterator::operator*() const {
   if (p == NULL) return unsigned(-1);
   if (seqlen == 0) calculate_sequence_length();
       unsigned char ch = *p;
       if (seqlen == 1) return ch;
       if (seqlen == 2) return ((ch & 0x1f) << 6) | (p[1] & 0x3f);
       if (seqlen == 3)
           return ((ch & 0x0f) << 12) | ((p[1] & 0x3f) << 6) | (p[2] & 0x3f);
       return ((ch & 0x07) << 18) | ((p[1] & 0x3f) << 12) |
                 ((p[2] & 0x3f) << 6) | (p[3] & 0x3f);
}
Utf8Iterator & operator++() {
    if (seqlen == 0) calculate_sequence_length();
    p += seqlen;
    if (p == end) p = NULL;
    seqlen = 0;
    return *this;
}
bool operator==(const Utf8Iterator &other) const { return p == other.p; }
bool operator!=(const Utf8Iterator &other) const { return p != other.p; }

Xapian 是通过调用termgenerator_internal.cc的方法index_text()方法最终处理字符串或字符数组。

index_text(Utf8Iterator(text), weight, prefix);

Xapian索引字符串或字符数组的处理主要可以分为四个阶段：

首先是判断Utf8Iterator类型的itor是不是为空，接着再判断 *p 是不是文字字符，如果 *p 是文字字符推出循环继续向下，如下：
```
while (true) {
   if (itor == Utf8Iterator()) return;
   ch = check_wordchar(*itor);
   if (ch) break;
   ++itor;
}
```
接着循环叠加字符成为term，提取词干就是依次提取并叠加直到下一个不是文字字符 。如提取词干“apple”，过程中term的值依次如下 ”a“ , "ap" , "app", "appl" ,"apple"。
```
do {
    Unicode::append_utf8(term, ch);
    prevch = ch;
    if (++itor == Utf8Iterator()) goto endofterm;
    ch = check_wordchar(*itor);
} while (ch);
```

索引词干

if (with_positions) {
    doc.add_posting(prefix + term, ++termpos, weight);
} else {
    doc.add_term(prefix + term, weight);
}

采用Snowball语言进行词干处理。大致的作用就是索引时建立footballs 与football的关联 , 在查询football时会返回footballs结果。
```
string stem("Z");
stem += prefix;
stem += stemmer(term);
doc.add_term(stem, weight);
```

总之，Xapian中的切词就是以下一位是不是文字字符 (空格，标点符号等) 来分割字符串。

Go 语言解析 git config	2019-03-17	Comments
二分查找捉虫记	2016-02-29	Comments
做一个有品位的程序员	2015-12-23	Comments

World Hello

xapian索引的term处理

Related Posts