Korean

A library for Korean morphology

Introduction

Sometimes you should localize your project for Korean. But common internationalization solutions such as Gettext are not working with non Indo-European languages well. We would get an awkward Korean sentence with those solutions because Korean has many morphological difference with Indo-European language.

korean a Python module provides useful Korean morphological functions for getting natural Korean sentences.

Allomorphic particle

In English, “be” is an allomorph. So the English localization system should can select the correct form such as “is”, “am”, “are”. Fortunately Gettext offers ngettext to make a natural plural expression. If it didn’t offer, you would see that awkward sentence:

>>> print _('Here is(are) %d apple(s).') % 1
Here is(are) 1 apple(s).

Some Korean particles (postposition) also have different allomorphs but they need different allomorphic selection rule; it needs check the preceding phoneme. However common internationalization solutions don’t offer about it. Of course, :mod:`korean does:

>>> from korean import Noun, NumberWord, Loanword
>>> fmt = u'{subj:은} {obj:을} 먹었다.'
>>> fmt2 = u'{subj:은} 레벨 {level:이} 되었다.'
>>> print fmt.format(subj=Noun(u'나'), obj=Noun(u'밥'))
나는 밥을 먹었다.
>>> print fmt.format(subj=Noun(u'학생'), obj=Noun(u'돈까스'))
학생은 돈까스를 먹었다.
>>> print fmt2.format(subj=Noun(u'용사'), level=NumberWord(4))
용사는 레벨 4가 되었다.
>>> print fmt2.format(subj=Noun(u'마왕'), level=NumberWord(98))
마왕은 레벨 98이 되었다.
>>> print fmt2.format(subj=Loanword(u'Leonardo da Vinci', 'ita'),
...                   level=NumberWord(67))
Leonardo da Vinci는 레벨 67이 되었다.

Working with Gettext

It also can be worked with Gettext. Just use korean.l10n.patch_gettext() function:

msgid ""
msgstr ""
"Locale: ko_KR\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: 8bit\n"

msgid "I like a {0}."
msgstr "나는 {0:을} 좋아합니다."

msgid "banana"
msgstr "바나나"

msgid "game"
msgstr "게임"
>>> from babel.support import Translations
>>> import korean
>>> translations = Translations.load('i18n', 'ko_KR')
>>> korean.l10n.patch_gettext(translations)
>>> _ = translations.ugettext
>>> _(u'I like a {0}.').format(_(u'banana'))
나는 바나나를 좋아합니다.
>>> _(u'I like a {0}.').format(_(u'game'))
나는 게임을 좋아합니다.

Proofreading legacy text

If your text already has been written with naive particle such as “을(를)”, use korean.l10n.proofread() fucntion to get correct particles:

>>> import korean
>>> korean.l10n.proofread(u'용사은(는) 검을(를) 획득했다.')
용사는 검을 획득했다.
>>> korean.l10n.proofread(u'집(으)로 가자.')
집으로 가자.

API

korean.morphology

copyright:
  1. 2012-2013 by Heungsub Lee
license:

BSD, see LICENSE for more details.

class korean.morphology.Morpheme(*forms)

This class presents a morpheme (형태소) or allomorph (이형태). It can have one or more forms. The first form means the basic allomorph (기본형).

Parameters:forms – each forms of allomorph. the first form will be basic allomorph.
basic()

The basic form of allomorph.

classmethod get(key)

Returns a pre-defined morpheme object by the given key.

read()

Every morpheme class would implement this method. They should make a morpheme to the valid Korean text with Hangul.

classmethod register(key, obj)

Registers a pre-defined morpheme object to the given key.

class korean.morphology.Particle(after_vowel, after_consonant=None, after_rieul=None)

Particle (조사) is a postposition in Korean. Some particles have different allomorphs such as 을/를, 이/가. These forms follow forward syllable ends what phoneme; a vowel, a consonant, or a Rieul (ㄹ).

class korean.morphology.Substantive(*forms)

A class for Korean substantive that is called “체언” in Korean.

class korean.morphology.Noun(*forms)

A class for Korean noun that is called “명사” in Korean.

read()

Reads a noun as Korean. The result will be Hangul.

>>> Noun('레벨42').read()
'레벨사십이'
class korean.morphology.NumberWord(number)

A class for Korean number word that is called “수사” in Korean.

read()

Reads number as Korean.

>>> NumberWord(1234567890).read()
'십이억삼천사백오십육만칠천팔백구십'
>>> NumberWord.read(0)
'영'
classmethod read_phases(number)

Reads number as Korean but seperates the result at each 10k.

>>> NumberWord.read_phases(1234567890)
('십이억', '삼천사백오십육만', '칠천팔백구십')
>>> NumberWord.read_phases(0)
('영',)
class korean.morphology.Loanword(word, code=None, iso639=None, lang=None)

A class for loanword that is called “외래어” in Korean. This depends on Hangulize which automatically transcribes a non-Korean word into Hangul.

New in version 0.1.4.

read()

Transcribes into Hangul using Hangulize.

>>> Loanword('Guido van Rossum', 'nld').read()
'히도 판로쉼'
>>> Loanword('საქართველო', 'kat').read()
'사카르트벨로'
>>> Loanword('Leonardo da Vinci', 'ita').read()
'레오나르도 다 빈치'

korean.l10n

Helpers for localization to Korean.

copyright:
  1. 2012-2013 by Heungsub Lee
license:

BSD, see LICENSE for more details.

class korean.l10n.Proofreading(token_types)

A function-like class. These __call__() replaces naive particles to be correct. First, it finds naive particles such as “을(를)” or “(으)로”. Then it checks the forward character of the particle and replace with a correct particle.

Parameters:token_types – specific types to make as token.
parse(text)

Tokenizes the given text with unicode text or Particle.

Parameters:text – the string that has been written with naive particles.
korean.l10n.proofread = <korean.l10n.Proofreading object at 0x1552e50>

Default Proofreading object. It tokenizes unicode and korean.Particle. Use it like a function.

class korean.l10n.Template

The Template object extends unicode and overrides format() method. This can format particle format spec without evincive Noun or NumberWord arguments.

Basically this example:

>>> import korean
>>> korean.l10n.Template('{0:을} 좋아합니다.').format('향수')
'향수를 좋아합니다.'

Is equivalent to the following:

>>> import korean
>>> '{0:을 좋아합니다.}'.format(korean.Noun('향수'))
'향수를 좋아합니다.'

korean.ext.gettext

Gettext is an internationalization and localization system commonly used for writing multilingual programs on Unix-like OS. This module contains utilities to integrate Korean and the Gettext system. It also works well with Babel.

copyright:
  1. 2012-2013 by Heungsub Lee
license:

BSD, see LICENSE for more details.

korean.ext.gettext.patch_gettext(translations)

Patches Gettext translations object to wrap the result with korean.l10n.Template. Then the result can work with a particle format spec.

For example, here’s a Gettext catalog for ko_KR:

msgid "{0} appears."
msgstr "{0:이} 나타났다."

msgid "John"
msgstr "존"

msgid "Christina"
msgstr "크리스티나"

You can use a particle format spec in Gettext messages after translations object is patched:

>>> translations = patch_gettext(translations)
>>> _ = translations.ugettext
>>> _('{0} appears.').format(_('John'))
'존이 나타났다.'
>>> _('{0} appears.').format(_('Christina'))
'크리스티나가 나타났다.'
Parameters:translations – the Gettext translations object to be patched that would refer the catalog for ko_KR.

korean.ext.jinja2

Jinja2 is one of the most used template engines for Python. This module contains Jinja2 template engine extensions to make korean easy to use.

New in version 0.1.5.

Changed in version 0.1.6: Moved from korean.l10n.jinja2ext to korean.ext.jinja2.

copyright:
  1. 2012-2013 by Heungsub Lee
license:

BSD, see LICENSE for more details.

class korean.ext.jinja2.ProofreadingExtension(environment)

A Jinja2 extention which registers the proofread filter and the proofread block:

<h1>ProofreadingExtension Usage</h1>

<h2>Single filter</h2>
{{ (name ~ '은(는) ' ~ obj ~ '을(를) 획득했다.')|proofread }}

<h2>Filter chaining</h2>
{{ '%s은(는) %s을(를) 획득했다.'|format(name, obj)|proofread }}

<h2><code>proofread</code> block</h2>
{% proofread %}
  {{ name }}은(는) {{ obj }}을(를) 획득했다.
{% endproofread %}

<h2>Conditional <code>proofread</code> block</h2>
{% proofread locale.startswith('ko') %}
  {{ name }}은(는) {{ obj }}을(를) 획득했다.
{% endproofread %}

The import name is korean.ext.jinja2.proofread. Just add it into your Jinja2 environment by the following code:

from jinja2 import Environment
jinja_env = Environment(extensions=['korean.ext.jinja2.proofread'])

New in version 0.1.5.

Changed in version 0.1.6: Added enabled argument to {% proofread %}.

korean.ext.jinja2.proofread

alias of ProofreadingExtension

korean.ext.django.templatetags.korean

A module containing Django template tag and filter for korean.

New in version 0.1.7.

copyright:
  1. 2012-2013 by Heungsub Lee, Hyunwoo Park
license:

BSD, see LICENSE for more details.

korean.ext.django.templatetags.korean.do_proofread(parser, token)

A Django tag for proofread

<h1>proofread tag Usage</h1>

{% load korean %}
{% proofread %}
  {{ name }}은(는) {{ obj }}을(를) 획득했다.
{% endproofread %}
korean.ext.django.templatetags.korean.proofread(*args, **kwargs)

A Django filter for proofread

<h1>proofread filter Usage</h1>

{% load korean %}
{{ 용사은(는) 검을(를) 획득했다.|proofread }}

korean.hangul

Processing a string written by Hangul. All code of here is based on hangul.py by Hye-Shik Chang at 2003.

copyright:
  1. 2012-2013 by Heungsub Lee and 2003 by Hye-Shik Chang
license:

BSD, see LICENSE for more details.

korean.hangul.char_offset(char)

Returns Hangul character offset from “가”.

korean.hangul.is_hangul(char)

Checks if the given character is written in Hangul.

korean.hangul.is_vowel(char)

Checks if the given character is a vowel of Hangul.

korean.hangul.is_consonant(char)

Checks if the given character is a consonant of Hangul.

korean.hangul.is_initial(char)

Checks if the given character is an initial consonant of Hangul.

korean.hangul.is_final(char)

Checks if the given character is a final consonant of Hangul. The final consonants contain what a joined multiple consonant and empty character.

korean.hangul.get_initial(char)

Returns an initial consonant from the given character.

korean.hangul.get_vowel(char)

Returns a vowel from the given character.

korean.hangul.get_final(char)

Returns a final consonant from the given character.

korean.hangul.split_char(char)

Splits the given character to a tuple where the first item is the initial consonant and the second the vowel and the third the final.

korean.hangul.join_char(splitted)

Joins a tuple in the form (initial, vowel, final) to a Hangul character.

Installation

Install via PyPI with easy_install or pip command:

$ easy_install korean
$ pip install korean

or check out development version:

$ git clone git://github.com/sublee/korean.git

Changelog

Version 0.1.7

Version 0.1.6

  • Moves korean.l10n.jinja2ext to korean.ext.jinja2.
  • Renames {% autoproofread %} to {% proofread %}.
  • Moves korean.l10n.patch_gettext() to korean.ext.gettext.patch_gettext().
  • Adds a condition argument to enable behind autoproofread Jinja2 block.
  • Fixes PEP8 errors without E301.

Version 0.1.5

Released on Jan 30th 2013.

  • Supports Python 3.
  • Adds korean.l10n.jinja2ext.ProofreadingExtension for Jinja2 template engine.

Version 0.1.4

Released on Aug 26th 2012.

Adds korean.morphology.Loanword.

Version 0.1.3

Released on Aug 15th 2012.

korean.l10n.Proofreading supports more various naive particle forms.

Version 0.1.2

Released on Aug 15th 2012.

Fixes an error on korean.l10n.Proofreading.

Version 0.1.1

Released on Aug 15th 2012.

Stop supporting Python 2.5.

Version 0.1

First public preview release.

Licensing and Author

This project licensed with BSD, so feel free to use and manipulate as long as you respect these licenses. See LICENSE for the details.

I’m Heungsub Lee. Any regarding questions or patches are welcomed.

Fork me on GitHub