Zend_Search and UTF-8 support

今天看到一条好消息,就是Zend Framework中Lucene的终于支持utf-8了。以前,我就想用这个做站内的搜索,但不能生成utf-8的索引。具体我还没怎么用,不知道效果怎么样?如果你需要使用Lucene,可以下载SVN中最新的版本。附上邮件原文:

Hi all,

Zend_Search_Lucene has some improvements in encoding management and
UTF-8 support.

1. Encoding of indexed documents.

It's no longer necessary to transform text into 'ascii//translit' before

Field creation methods have optional $encoding parameter now (excluding

:Binary() method).
It may differ for different documents as well as for different fields
within one document:
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title,
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents,

If encoding parameter is omitted, then current locale is used at
processing time.
Current locale may also contain character set info:
setlocale(LC_ALL, 'de_DE.iso-8859-1');

Fields are always stored and returned from index in UTF-8 encoding.
Conversion to UTF-8 proceeds automatically.

Text analyzers may also convert text to some other encodings. Actually,
default analyzer converts text to 'ASCII//TRANSLIT' encoding.
Such translation may be affected by current locale.

2. Search results encoding.

Search results (stored fields) are always returned in UTF-8.

3. Query string encoding.

Zend_Search_Lucene_Search_QueryParser::parse() method has optional
encoding parameter now. It's used to specify query string encoding:
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr,
If encoding is omitted, then current locale is used.

It's also possible to specify default query string encoding with
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding() method:

$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);

Zend_Search_Lucene_Search_QueryParser::getDefaultEncoding() returns
current default query string encoding (empty string means "current locale").

4. Limited functionality UTF-8 analyzer.

New text analyzer with UTF-8 support is provided with Zend_Search_Lucene

It can be turned on with:
    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

This analyzer tokenizes data for indexing in UTF-8 mode and has no
problems with umlauts, Cyrillic, Arabic, Chinese and any other
characters, which may be represented in UTF-8.

– utf-8 analyzer treats all non-ascii characters as letters (it's not
always true).
– it's case-sensitive.

Because of these limitations it's not set as default.
But it may be helpful for someone.

Case insensitivity my be emulated with strtolower() conversion:
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

$doc = new Zend_Search_Lucene_Document();


// Title field for search through (indexed, unstored)

// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));

The same conversion has to be performed with query string:
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

    new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

$hits = $index->find(strtolower($query));

These features are also documented now.

Take SVN version 🙂


分享到: 更多