On Hangul Supremacy & Exclusivity – An Information Theory Comparison of Hangul and Hanja

RoadSigns

There’s a good reason why these road signs are not written out in letters.

This is one post in series on Hangul Supremacy and Hangul Exclusivity. Hangul Supremacy (–優秀主義, 한글우수주의) is the widespread belief that Hangul is superior, especially in opposition to Chinese characters. Hangul Exclusivity (–專用, 한글전용) is closely related and refers to writing exclusively in Hangul. The purpose of these posts is to introduce Anglophone readers to the Korean debate over Hangul and Hanja. 

An Information Theory Comparison of Hangul and Hanja

A picture is worth a thousand words.” This is a well-known English proverb. Most do not think about why this proverb is true, because its proof seems quite obvious. If one were to describe a picture in words, he would need indeed a lot of words. Why is that so? The answer can be found in information theory, a field of study that has made modern digital communications possible and plays an important role in several other fields including linguistics.

A Layman’s Short Introduction to Information Theory

Information theory involves the quantification of “information.” Information should be distinguished from semantic meanings. Messages and symbols therein often have meaning referring to some system of physical or abstract things. Information theory does not care what these symbols actually mean. This is not to say that information has no relevance to meaning; messages tend to communicate meaning and have information. Information is concerned with each symbol’s probability of being observed; it is concerned with reduction in uncertainty in observing each symbol.

Illustrative Examples

Information thus depends on the distribution of the symbols, or elements, that are in one set of symbols. In general, the higher the number of elements, the higher the information. For instance, imagine observing a coin toss and a fair die roll. The probability of heads and tails of the fair coin is 1/2 each. The probability of each side of the die is 1/6 each. Comparatively, there is more uncertainty in the outcome of the die roll than the outcome of a coin flip. When the actual outcome is observed, the uncertainty in each observation is reduced to zero. Since more uncertainty was reduced in the fair die roll, there was more information provided by the outcome of the die roll than the coin flip.

Thus, it can be easily seen why the proverb “a picture is worth a thousand words” is true. A picture is effectively a set with an infinite number of symbols, each symbol with an infinitesimal probability of being observed, regardless of how each symbol is defined. On the other hand, words are effectively a set with a finite number of symbols. To get a empirical sense, approximately 10,000 words comprise the vocabulary of native speakers with higher education. The word set is thus far smaller than the picture set. Therefore, observing a picture reduces uncertainty much more than observing words, and a coin flip and die roll.

Measuring Information

Mathematically, assuming that each symbol is independent, the information of a symbol is:

Self-Information

where I(m) is the information of a symbol and p(m) is the probability of observing symbol m. For a number of reasons, it is a logarithmic measure. Principally, it is close to the intuitive sense of a measure of information: that is, the amount of information increases at a slower rate than the increase of symbols in a set. The unit for information is typically bits-per-symbol, because computers use binary numbers (i.e., 1s and 0s) and most applications of information theory involve digital communication using computers.

The average information for a set of symbols is called entropy. Mathematically, entropy is:

Entropy

where H(M) is the entropy measured in bits-per-symbol, M is the total set of symbols, and p(m) is the probability of observing symbol m. There are a number of ways to interpret what entropy is. One is the average amount of information provided by the distribution of the set of symbols. From this interpretation, the higher the entropy, the higher the average information for that set of symbols. It should be noted that the average information of a set is the highest, when all the symbols have an equal probability of being observed. This is also when the entropy of the entire set equals the information of each symbol.

Comparing the Information Hangul Versus Hanja

According to information theory, Hangul should have a lower amount of average information than Hanja. Hangul is a phonetic alphabet comprising of only 24 symbols. Hanja, in contrast, is an ideogram comprising of more than 40,000 symbols, out of which only about 2,000 are considered “common use” in Korea. From the start, it can be readily recognized that there is a lot more uncertainty in observing a Hanja character versus observing a Hangul letter.

To get a sense of the disparity, assume that each symbol in each respective script occurs with equal probability and is independent. That is, each alphabet of Hangul occurs 1/24 of the time and each character in Hanja occurs 1/2000 of the time. (The actual probability for Hangul ranges from 0.122 for ㅇ and 0.002 for ㅋ. This blogger has not yet found a complete listing for Hanja). Thus, the entropy of Hangul is only 4.75 bits-per-symbol, while the entropy of Hanja is 10.96 bits-per-symbol — and 15.29 bits-per-symbol, when 40,000 symbols are considered. This means that each character of Hanja conveys much more information than each alphabet of Hangul.

Of course, this assumption that each symbol in each respective script occurs with equal probability is not entirely correct. Certain symbols do occur with more frequency than others, and therefore the entropy in actuality will be much lower. This, however, does not detract away from the finding that each character of Hanja conveys more information than each letter of Hangul: there is still a lot more Hanja characters than Hangul letters. The fact that Hanja conveys more information than Hangul has ramifications in the semantic meaning conveyed by each symbol.

For example, take the Hangul letters “일.” It has three symbols: ㅇ, ㅣ, and ㄹ. Even with three symbols, the semantic meaning is highly ambiguous. It could mean “one,” “work,” “day,” or even a grammatical particle. Contrast this to seeing just one Hanja character. Since there is a lot more information, the characters 一 (one), 業 (work), and 日 (day) are less ambiguous. Consider also the following examples, comparing one Hanja character with the number of Hangul letters required to transmit the same semantic meaning:

  • 車(1) → 차(3) (“car”)
  • 天(1) → 천(3), 하늘(5) (“sky”)
  • 止(1) → 지(2), 멈추다(7) (“to stop”)
  • 褰(1) → 건(3), 옷을걷어올리다(18) (“to hang up clothes”)
  • 蔭(1) → 음(3), 조상의 공덕에 의하여 맡은 벼슬 (33) (“A bureaucratic position attained based on merits of an ancestor”)

This finding should not be surprising. In no instance, can the representation in Hangul be more compact than the representation in Hanja. Since Hanja characters have a higher amount of information, more Hangul letters are necessary to convey the same amount of information — and incidentally the same meaning. (One can also see this is the case with Morse code versus English).

This is more apparent with prose text. Compare the original Classical Chinese text of the Pater Noster (天主經, 천주경) versus the Hangul-only Korean translation (both are Catholic translations):

Pater Noster

Notice how few the number of symbols are in the Classical Chinese text is compared to how many Hangul letters are in the Hangul-only Korean translation. Both are roughly the same symbolic representations of the underlying semantic meaning. Hangul only appears more compact, simply because of its arrangement into syllable blocks. Other comparisons of Classical Chinese text and mixed script versus Hangul-only representations will show the same result, without fail. Hangul is vastly inferior from an information theory perspective. 

Conclusion

This blogger conceived of this argument, to introduce much needed objectivity the debate between Hangul exclusivity and mixed script. In the end, subjective arguments, such as appeals to nationalism, history, aesthetics, and et cetera, amount to mere sentiment. The “superiority” of a script is not one dimensional; there are a number of measures. Ease of learning is one measure, albeit very subjective. The amount of information conveyed is another. The latter is perhaps more significant and objective measure, because the most important function of any script is to convey meaning. There is also the issue of transcribing meanings not represent-able in Hanja versus Hangul. A mixed script with an optimal distribution of probability between the two scripts would actually increase the information content, shortening the amount of symbols needed to convey the same meaning while maintaining ease of learning.

Disclosure: This blogger has had his acquaintances, who are far better versed in information theory than he is, verify the argument in this blog post.

5 comments
  1. Very interesting post! I’m not in complete agreement but the following is meant only in the spirit of enthused discussion..

    I would argue that the “arrangement of hangul into syllable blocks” is a crucial characteristic: hangul functions more like an abugida than an alphabet. It is not the number of letters but the number of syllabic combinations which should be compared to hanja. Comparing the individual letters is more like comparing the strokes or radicals of hanja characters.

    I would also question the semantic clearness of hanja characters versus hangul syllables, because many of the more common characters do not have one-to-one semantic correlations; hence literary Chinese is so difficult to translate (though that is perhaps only because we cannot be certain of specific usage in earlier periods). The meaning of hanja characters depends as much on context as hangul (although, again, this is more the case in literary Chinese than mixed-script Korean where the hanja have generally been borrowed with a single meaning or usage in a particular word).

    Whilst many hanja do carry more semantic meaning than hangul syllables, they carry significantly less phonetic information than hangul. So I would ask the question: is the most important function of a script to convey semantic or phonetic value?

    In terms of comparing the spatial efficiency of hanja against hangul, we should also consider time efficiency in writing. Here hangul probably has the edge as, on average, it requires fewer strokes to write hangul syllables than hanja characters.

    If you are advocating mixed script (which I actually support!) it will make no spatial difference because hangul syllables take up the same space as hanja characters and morphological components are still to be written in hangul. Spatial efficiency would only be improved by writing in Chinese language, as per your example!

    I’d finally also point out that hanja are not strictly pictograms. For the most part they, at best, indicate only the broad semantic category and often their current usage has evolved away from the original meaning, though I’m not sure if this is a relevant argument because they are still symbols representing meaning (though for fluent readers of any language that is what spelt words become too).

    • 歸源 said:

      It is alright. The purpose of these posts is to generate discussion.

      “I would argue that the “arrangement of hangul into syllable blocks” is a crucial characteristic: hangul functions more like an abugida than an alphabet. It is not the number of letters but the number of syllabic combinations which should be compared to hanja. Comparing the individual letters is more like comparing the strokes or radicals of hanja characters.”

      Although I didn’t mention it in the post, I did consider whether I should consider a whole syllable block of Hangul as a one symbol and whether I should decompose a Chinese character to its constituent elements. I chose the analysis as how it is presented in the blog post, because in the end no matter how it’s sliced characters contain much more information. Plus, it is pretty common to see Hangul arranged much like English (e.g., ㅎ ㅏ ㄴ ㄱ ㅡ ㄹ) in colloquial settings.

      “I would also question the semantic clearness of hanja characters versus hangul syllables, because many of the more common characters do not have one-to-one semantic correlations; hence literary Chinese is so difficult to translate (though that is perhaps only because we cannot be certain of specific usage in earlier periods). The meaning of hanja characters depends as much on context as hangul (although, again, this is more the case in literary Chinese than mixed-script Korean where the hanja have generally been borrowed with a single meaning or usage in a particular word).”

      The specific link between information and semantic meaning is unclear (as I’ve have read from a few papers), other than the fact that there’s some relation. This is why the comparisons I make are not as clear cut, as if one were to compare morse code versus English alphabet.

      “Whilst many hanja do carry more semantic meaning than hangul syllables, they carry significantly less phonetic information than hangul. So I would ask the question: is the most important function of a script to convey semantic or phonetic value?”

      I would say that Hanja and Hangul convey different forms of semantic meaning, which would include the pronunciation from an information theory perspective. Recognizing this, I do value semantics over the phonetic value. This is why I support the return to mixed script.

  2. JW said:

    I agree with the main point of this post – that Hanja encodes more information and is more concise than Hangul.

    After all, inside a single Hanja we have considerable complexity. This is the case even for simplified hanzi – as you said, they’re still in a sense basically highly stylized descriptive pictures (even if the picture has changed so much that it’s no longer obvious what the picture is describing and also encode phonetic information).

    However, I think that a big part of the conciseness in your example with prose text is due to the inherent conciseness of the Classical Chinese language. (It’s a language in its own right, with its own grammar that’s different not only from Korean but also different from Cantonese or Mandarin.) Although descended from a spoken language, it probably hasn’t been spoken since Confucius’s time. Classical Chinese achieves this by partly giving up the ability to be understood when spoken aloud by a simple reading of the names of the hanja. (Case in point – there was a famous poem about the Lion Eating Poet in the Stone Den by Zhao Yuanren that was essentially one syllable repeated over and over. It makes sense when read from the text but try to understand it by listening to someone else reading it out loud. This is admittedly an extreme case.)

    You could probably save some space in the Korean version by using mixed script, and then replacing stuff like the two syllable word for sky with the one character hanja for sky. But the mixed script version would still include the native Korean particles, which themselves take up a lot of space. Even if you swapped out Hangul entirely, and found a way to use hanja to represent these particles, they’d still be there taking up space. Something similar could occur in Mandarin, which also has a two syllable word for ‘sky’: 天空. Mandarin uses words with multiple syllables, more so than Classical Chinese does. The reason for this is to make it intelligible when spoken aloud.

    A case in point – the folktale about the fox borrowing the tiger’s might. Look how short the Classical Chinese version is:

    虎求百獸而食之,得狐。狐曰:子無敢食我也。天帝使我長百獸,今子食我,是逆天帝命
    也。子以我為不信,吾為子先行,子隨我後,觀百獸之見我而敢不走乎?虎以為然,故遂
    與之行。獸見之皆走。虎不知獸畏己而走也,以為畏狐也。

    Look how much longer the still-all-hanja Mandarin version is:

    老虎寻找各种野兽吃掉他们,抓到(一只)狐狸。狐狸说:“您不敢吃我!上帝派遣我来做
    各种野兽的首领,现在你吃掉我,是违背上帝的命令。你认为我的(话)不诚实,我在你
    前面行走,你跟随在我后面,观看各种野兽看见我有敢不逃跑的吗?”老虎认为(狐狸的话
    )是有道理的,所以就和它(一起)走。野兽看见它们都逃跑了。老虎不知道野兽是害怕
    自己而逃跑的,认为(它们)是害怕狐狸。

    A Korean (all hangul or mixed script) version may be even longer still, but of course Mandarin doesn’t use as many particles as Korean does.

    • To me one problem with your approach (which i find interesting and worthwhile) is that Chinese characters tend to have a higher density of strokes per square when compared to Hangul or Latin letters. This means that comparing the surface area of ‘equivalent’ texts (for which there are a number; translation, transcription and so on) will neglect the ratio actually taken up by the characters and the respective optical size needed so a reader can readily discern all the pertinent details.

      Another question is what ‘granularity’ we should choose for comparing to scripts; you choose Hanja and Hangeul Jamo, but it could be argued that, as for the Chinese characters, their constituent parts should be chosen (although there does not exist an exhaustive list of these; they should number several hundreds up to a thousand or so), whereas for Hangeul, not the single Jamo, but the composite syllables should be chosen (of which there are roughly 11000 i think).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: