今天看到一条好消息,就是Zend Framework中Lucene的终于支持utf-8了。以前,我就想用这个做站内的搜索,但不能生成utf-8的索引。具体我还没怎么用,不知道效果怎么样?如果你需要使用Lucene,可以下载SVN中最新的版本。附上邮件原文:
1. Encoding of indexed documents.
:Binary() method).
It may differ for different documents as well as for different fields
within one document:
————
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title,
'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents,
'utf-8'));
————
If encoding parameter is omitted, then current locale is used at
processing time.
Current locale may also contain character set info:
————
setlocale(LC_ALL, 'de_DE.iso-8859-1');
————
Fields are always stored and returned from index in UTF-8 encoding.
Conversion to UTF-8 proceeds automatically.
Text analyzers may also convert text to some other encodings. Actually,
default analyzer converts text to 'ASCII//TRANSLIT' encoding.
Such translation may be affected by current locale.
2. Search results encoding.
Search results (stored fields) are always returned in UTF-8.
3. Query string encoding.
Zend_Search_Lucene_Search_QueryParser::parse() method has optional
encoding parameter now. It's used to specify query string encoding:
————
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr,
'iso-8859-5');
————
If encoding is omitted, then current locale is used.
It's also possible to specify default query string encoding with
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding() method:
————
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-5');
…
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
————
Zend_Search_Lucene_Search_QueryParser::getDefaultEncoding() returns
current default query string encoding (empty string means "current locale").
4. Limited functionality UTF-8 analyzer.
New text analyzer with UTF-8 support is provided with Zend_Search_Lucene
now.
It can be turned on with:
————
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
————
This analyzer tokenizes data for indexing in UTF-8 mode and has no
problems with umlauts, Cyrillic, Arabic, Chinese and any other
characters, which may be represented in UTF-8.
Limitations:
– utf-8 analyzer treats all non-ascii characters as letters (it's not
always true).
– it's case-sensitive.
Because of these limitations it's not set as default.
But it may be helpful for someone.
Case insensitivity my be emulated with strtolower() conversion:
————
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
strtolower($contents)));
// Title field for search through (indexed, unstored)
$doc->addField(Zend_Search_Lucene_Field::UnStored('title',
strtolower($title)));
// Title field for retrieving (unindexed, stored)
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
————
The same conversion has to be performed with query string:
————
setlocale(LC_CTYPE, 'de_DE.iso-8859-1');
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
$hits = $index->find(strtolower($query));
————
These features are also documented now.
Take SVN version 🙂