Hermanland

Computing, Photography, Cameras

In search of UTF-8 encoded text

without comments

Text searching is a useful function in many applications. Unlike each English word which is delimited by a character e.g. space character, the Asian languages e.g. Chinese post a challenge for the search engine in indexing. To the engine, a sentence like 我能吞下玻璃而不伤身体 will have no idea of word. We may search for 玻璃 or even 吞下玻and system should find the text. One solution is to integrate a dictionary into the search engine. Another way is to implement an analysis mechanism. The Jarkata Lucene search engine is popular in most Java application. Some package may use org.apache.lucene.analysis.Simple.SimpleAnalyzer be default and this only works for English text searching. We need to change to org.apache.lucene.analysis.standard.StandardAnalyzer for UTF-8 encoded text so that it will work for Chinese or other Asian language.

Can you see the text below?

Arabic : أنا قادر على أكل الزجاج و هذا لا يؤلمني.
Chinese (Simplified): 我能吞下玻璃而不伤身体。
Chinese (Traditional): 我能吞下玻璃而不傷身體。
Deutsch / German: Ich kann Glas essen, ohne mir weh zu tun.
English: I can eat glass and it doesn’t hurt me.
French: Je peux manger du verre, ça ne me fait pas de mal.
Greek: Μπορώ να φάω σπασμένα γυαλιά χωρίς να πάθω τίποτα.
Japanese: 私はガラスを食べられます。それは私を傷つけません。

Below is what you should see on the screen.

Samples taken from the fantastic utf-8 sampler website. Thank you.

Written by herman

February 23rd, 2008 at 12:09 am

Posted in Computing

Tagged with

Leave a Reply