Tuesday, January 31, 2012

Kakasi-java: born again

The Japanese language has several symbols, including kanji and hiragana/katakana. In software, we sometimes need to switch a text from one system to the other, and it is difficult.

Kakasi and MeCab are Open Source libraries dedicated to the problem of converting kanji to hiragana or katakana. For instance they can transform "国際財務報告基準" to "こくさいざいむほうこくきじゅん" or even to "kokusaizaimuhoukokukijun". In clear, it transforms logograms (symbols with multiple possible readings) to syllables.
That is very tricky, because for instance "経緯" can be transformed to "keii", but also to "ikisatsu" depending on the context or speaker. Kakasi sometimes gets it wrong, but usually it is not that bad. MeCab is actually better at that.

Yesterday I decided to add a "furigana" feature to my Android flashcards app. Furigana helps people read difficult kanjis, they are used a lot in mass media: books, newspapers, signs, advertisements.
Kakasi and MeCab are both conversion tools, but their internal algorithms are very different, leading to different speed/quality/simplicity characteristics. Before running to MeCab, I decided to also give Kakasi a try.

Unfortunately, Kakasi is written in C, and thus not easy to run on Android. Porting from C to Java would be possible, but before doing it I had to make sure nobody had ported it already. After multiple searches, I finally found a tar file of the blog of Kenichi Maehashi, saying "現在どこからも入手できないようです". In clear: Kakasi-java can not be found anymore on the Internet, so he uploaded the 0.4 version he miraculously found in his backups.

To make improvements and fixes possible, I took the source, compiled, tested it, wrote a little README file and created a project for it on GitHub. Code contributions are welcome :-)

The best would be a Java port of MeCab, but that does not seem to exist. MeCab has a Java binding, but it is not 100% Java, requiring JNI calls, which is not a great idea for Android.
Nicolas Raoul

2016 update: I just created Jakaroma, its kanji transliteration is much more accurate so please use it instead of Kakasi-java. It is also open source.

6 comments:

  1. I have a similar project to implement kakasi by Python.
    https://github.com/miurahr/pykakasi

    Kakasi is very old software and it is surprised that reborn here. And I'm glad to see you who are interested in kakasi.

    A pykakasi project stopped at a status in partial because this is a side project of unihandecode, universal transliterate library.

    If you or reader is interested in improve it, you are welcome.

    Hiroshi

    ReplyDelete
  2. Hello Miura-san,

    Very nice!
    I am curious, why did you choose to port Kakasi, not MeCab?
    MeCab seems to be more accurate.
    Maybe MeCab did not exist at that time?

    Keep up the great work!
    Nicolas

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. how to use kakasi in android properly ?
    because its always says "KanjiConverter does not support character set: "
    please give me a brief tutorial to use it in android studio... i will very grateful if u will..

    ReplyDelete
  5. @dev: You should use Jakaroma instead, it is much more accurate :-) https://github.com/nicolas-raoul/jakaroma

    ReplyDelete
    Replies
    1. works like charm :D
      thank you very much nicolas, you have such a great work..

      Delete