Zend_Search and UTF-8 support

今天看到一条好消息，就是Zend Framework中Lucene的终于支持utf-8了。以前，我就想用这个做站内的搜索，但不能生成utf-8的索引。具体我还没怎么用，不知道效果怎么样？如果你需要使用Lucene，可以下载SVN中最新的版本。附上邮件原文：

Hi all,

Zend_Search_Lucene has some improvements in encoding management and
UTF-8 support.

1. Encoding of indexed documents.

It's no longer necessary to transform text into 'ascii//translit' before
indexing.

Field creation methods have optional $encoding parameter now (excluding
Zend_Search_Lucene_Field:

:Binary() method).
It may differ for different documents as well as for different fields
within one document:
————
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title,
'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents,
'utf-8'));
————

If encoding parameter is omitted, then current locale is used at
processing time.
Current locale may also contain character set info:
————
setlocale(LC_ALL, 'de_DE.iso-8859-1');
————

Fields are always stored and returned from index in UTF-8 encoding.
Conversion to UTF-8 proceeds automatically.

Text analyzers may also convert text to some other encodings. Actually,
default analyzer converts text to 'ASCII//TRANSLIT' encoding.
Such translation may be affected by current locale.

2. Search results encoding.

Search results (stored fields) are always returned in UTF-8.

3. Query string encoding.

Zend_Search_Lucene_Search_QueryParser::parse() method has optional
encoding parameter now. It's used to specify query string encoding:
————
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr,
'iso-8859-5');
————
If encoding is omitted, then current locale is used.

It's also possible to specify default query string encoding with
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding() method:
————
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-5');
…
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
————

Zend_Search_Lucene_Search_QueryParser::getDefaultEncoding() returns
current default query string encoding (empty string means "current locale").

4. Limited functionality UTF-8 analyzer.

New text analyzer with UTF-8 support is provided with Zend_Search_Lucene
now.

It can be turned on with:
————
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
————

This analyzer tokenizes data for indexing in UTF-8 mode and has no
problems with umlauts, Cyrillic, Arabic, Chinese and any other
characters, which may be represented in UTF-8.

Limitations:
– utf-8 analyzer treats all non-ascii characters as letters (it's not
always true).
– it's case-sensitive.

Because of these limitations it's not set as default.
But it may be helpful for someone.

Case insensitivity my be emulated with strtolower() conversion:
————
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
strtolower($contents)));

// Title field for search through (indexed, unstored)
$doc->addField(Zend_Search_Lucene_Field::UnStored('title',
strtolower($title)));

// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
————

The same conversion has to be performed with query string:
————
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());

$hits = $index->find(strtolower($query));
————

These features are also documented now.

Take SVN version 🙂

分享到：更多