A Quick, Informal Statistical Analysis on Analects of Confucius



As I have been working on the Classical Chinese primer, one question I have been pondering about is how many Chinese Characters (漢字, 한자) should one memorize before embarking on learning Classical Chinese (漢文, 한문). The Korean Hanja Proficiency Exam (漢字能力檢定試驗, 한자능력검정시험) specifies that people should learn at least up till the first rank (一級, 1급), or 3,500 characters, to read Classical Chinese “without difficulty.” This is a somewhat subjective judgment, and depends on how willing a reader is in looking up characters that he does not know while reading Classical Chinese texts. I like to conceptualize this in more mathematical terms:


where p(x) is the probability that reader does not know character xC(x) is the cost (e.g., time and effort) reader is willing to spend on looking up each character x, and T is the threshold at which reader will “give up” on finding all the characters. The equation as a whole states that a reader will be willing to find the character, as long as the cost and probability of doing so does not exceed the threshold. C(x) and especially are highly subjective, and depend on the individual reader. p(x), on the other hand, is not. I was interested in seeing how p(x) looked like, and how I could interpret it.


I had some downtime over Easter, and decided to code a very short script to determine this. The pseudo-code is very simple, and is as follows:

  1. Load file with Classical Chinese source text.
  2. Remove all the punctuation, spaces, new lines, et cetera.
  3. Count the number of time a particular character occurs in the source text.
  4. Output data.

The Classical Chinese source text chosen was Analects of Confucius, Annotated by Zhu Xi (論語集註, 논어집주). I believed that this was very representative of Classical Chinese texts, as many people learning Classical Chinese at the very least read Analects unannotated.

Data & Analysis

The total number of characters in the Analects is 80,964. There are 2,373 different characters. Sorting from the most frequent to the least, the top 20 most frequent characters are:

Top 20 Most Frequent

Table 1 – Top 20 Most Frequent Characters in Analects

This result should not surprise anyone, as most of these characters serve common grammatical functions and thus are likely to appear quite frequently. For instance, from experience, 也(야) occurs at the end of sentences very often. It is no surprise that Table 1 reflects this. One curious result was how quickly the frequency dropped: from 3756 with 之(지) to 644 with 矣(의). This is further illustrated in the table below:

Random Characters

Table 2 – Assorted Characters

Table 2 shows a few characters by the order of their frequency in the text. By the most common 150th character, the frequency has dropped into two digits. By the most common 850th character, the frequency has dropped to just one digit. By the most common 1900th character, the frequency has dropped to 1.

Freq v Percent

Table 3 – Frequency & Percentage

In Table 3, characters that appear less than 1000 times in the text occur with 74.8%, those that appear less than 250 times occur with 50%, and those that appear less than 50 times, occur with 22.2%.


First, a caveat. This quick, informal analysis does have some weaknesses, particularly with the sample source text chosen. Some characters that are considered as “easy” in resources for learning Chinese Characters surprisingly showed up as occurring very few times in Analects. For instance, 雨(우) (“rain”) only occurred a total of three times in the text. Perhaps the most fatal weakness is that there were not even 3,500 different characters in Analects. For future analysis, I would like to do add other texts to increase the sample size.

As for the relationship between learning Classical Chinese and memorizing Chinese Characters, the data suggest that readers should have an expansive of knowledge of Chinese Characters. Most notably, although characters that appeared less than 10 times or less occur with 6% probability, those that 100 times or less occur with 33% of the time in the text. In general, the less likely the character occurs, the more it is considered “difficult.” I would presume that most readers, who have not yet memorized less frequent characters, would not want to be flipping through their dictionaries one-third of the time while reading through Analects, as this would exceed the threshold cost they are willing to endure. This data, although not perfect, may give an idea where this threshold may be.

  1. gbevers said:

    Apparently you had quite a lot of down time over Easter.

    Anyway,I am more interested in the graphic you posted than the statistics. The graphic shows the first page of a text explaining the “Analects” (論語) in Chinese. The annotations show that the writer did not expect all readers to understand the Confucian quotes. After the first quote, for example, the writer wrote 說悅同, which means “說 and 悅 are the same.” In other words, the writer assumed there would be, at least, some people who would not know the meaning of 說 in the context of the quote. He probably assumed that there would be, at least, some people who would see 說 and immediately think “to speak” (설) instead of “to be happy” (열).

    So, the Analects was even difficult for the Chinese to understand. That makes me feel better about my lack of understanding.

    • 歸源 said:

      In earlier Classical Chinese texts, they did not differentiate meanings of characters with radicals as much. 說 was pronounced 설 or 열, depending on the meaning. That said, I would highly recommend reading Confucian Classics (especially 詩經) with annotations. Annotations are easier to read, and give a more in depth interpretation.

      • Kuiwon, I have read the Analects with modern Korean annotations, but not with old Chinese. I had never seen this text before you posted it, or even heard of it, which is why I am so interested in it. Yes, I am fairly ignorant of 漢學. I need to take some time and read the “한학입문” text I bought in Korea before returning to the US to get a general idea of what is available. It has just been sitting on my bookshelf here because I have been more focused on learning to reading 漢文 than on studying the historical books available.

        I cannot say the annotations are easier to read, but they are more fun and interesting for me to read because they not only help me practice my reading skills, but also show what the Chinese at the time considered to be difficult about the Analects. I had assumed, for example, that the average Chinese scholar would have known the meaning of 說 from the context without it being explained to him. For me that kind of stuff is interesting.

        I have already found a site, HERE to study the text you posted. Thanks for your post.

      • gbevers said:

        Thanks for the link, Kuiwon. I saw that post before, but only glanced at, thinking that it was just a review of a book that I probably wouldn’t be interested in. However, today I read it and found it pretty interesting.

        By the way, I have recently received a book entitled 한문해석사전, by 김원준. It is supposedly a 虛辭 (허사) dictionary that defines 900 so-called 허사 and uses lots of example sentences. It is a big, expensive book with over 1500 pages. It looks promising, but I have not really got into it yet, so I cannot yet recommend it at the 61,750 won price Kyobo is asking. Have you heard anything about it?


      • 歸源 said:

        I actually do have that book. I was planning on doing a Book Review post on it, but haven’t got around to it.

      • gbevers said:

        Good. I am looking forward to reading that book review.

        One of the reasons I bought the book was that I didn’t know there was as many as 900 虛辭 and was curious to learn what they were.

  2. Gari Ledyard said:

    Hi Kuiwon, In your counting, did you include all the characters in both the main text and Chu Xi’s commentary? If so can you provide separate character counts for the 론어 and the 주희 commentary?

    • 歸源 said:

      Hello, in my counting, I included both the main text and commentary. The character count for the main text was 16,000. Subtracting this from the total, the character count for the commentary comes out to 64,964. I did not do any “cleaning” of the text, besides removing punctuation and the like, so these numbers are approximations.

  3. gbevers said:

    Kuiwon, I am getting impatient waiting for your next post, so I thought I might try to stimulate a little discussion. The second sentence of the notes to the first saying was as follows:

    學之爲言效也. (학지위언효야.)

    How should the above sentence be translated, specifically 言效? I have seen it translated as “to emulate” and “to imitate,” but then I wonder whom is being imitated. There was no mention of a teacher. If one is studying by oneself, then there is no one to imitate, but if one is studying as young children used to study the “1,000 Character Classic,” by repeating after their teacher, then it makes more sense. Without dictionaries back during the days of Confucius, I guess the only way to learn to pronounce the characters, even for adults, was to repeat them after a teacher. Have you seen another translation for 言效?

    Next, should 時 be translated as “constantly” or “occasionally”? Korean translators seem to prefer “occasionally,” but if people really were repeating after their teachers back in the day, then I think that would have been a “constant” exercise.

    • 歸源 said:

      I apologize for the lack of updates. I have been fairly busy lately. I would parse the sentence as follows:

      學之爲 / 言 / 效也.
      Noun / Verb / Predicate

      The entire phrase should be translated as “The pursuit (爲, 위) of (之, 지) studying (學, 학) is to say (言, 언) to emulate (效, 효). I would view 言 as separate from 效. I would think in this context “to emulate” means to model oneself after the Confucian sages.

      I would translate 時 as “at proper times” or “from time to time.” “Occasionally” doesn’t quite capture it.

  4. gbevers said:

    No, I think you mistranslated it, Kuiwon.

    Remember that Zhu Xi was interpreting the Confucian saying: 學而習之, 不亦說乎, so the 學之 (to learn it) was referring to the 學之 in the saying, and the 爲 is the verb “to be,” which can be translated here as “to mean,” resulting in the translation, “To learn it (學之) means (爲) to verbally (言) imitate (效也) [the teacher].

    學而習之 is just an abbreviation of 學之而習之, which means “to learn (學) it (之) and (而) to practice (習) it (之).” In other words, Zhu Xi was explaining what Confucius meant by the phrase “to learn it (學之), and the way he interpreted it was that “‘To learn it’ means to verbally imitate [the teacher].”

    Apparently, the way they learned back during the time of Confucius was to have the students repeat the pronunciation of the characters after the teacher had read them, a method of teaching we still use today. Then, after the students had learned to pronounce the characters, they were expected to practice reading them on their own while the teacher looked on, whacking with his stick anyone who made a mistake. That process of learning was apparently something Confucius thought was very enjoyable.

    • 歸源 said:

      Although your translation isn’t grammatically incorrect, I wouldn’t interpret it that way. In Neo-Confucian annotations, the construction A言B也 (A is to say B) is fairly common. Other common patterns include A者B也 (A is B) and A猶B也 (A is similar to B). My first intuition would be as how I translated it.

      • gbevers said:

        Sorry, Kuiwon, I misread what you wrote. I thought you had written “isn’t grammatically correct.” You confused me with that double negative. 🙂

      • 歸源 said:

        No problem. It’s just that although your interpretation is grammatically correct, it’s not common to see 爲 as the copula (to be) in annotations. So my first instinct would be to see 言 as the copula verb. The Korean translation I found online is “학이란 본받는 것을 말한다” which is closer to my interpretation.

        Also, I would think 也 as just a sentence termination marker. It can be translated -인가?(interrogative), -이다(copula), or sometimes 때문이다(causal).

      • gbevers said:

        Kuiwon, on Page 73 of Pulleyblank’s book “Outline of Classical Chinese Grammar,” under the topic ending “Other Particles Marking Topicalization or Contrastive Exposure,” 也 is the first particle listed. It says, “The use of 也 in these constructions is illustrated in such examples as 249 above. It is found especially, as there, when the topic phrase is a nominalized verbal phrase.”

        I think the Korean translation also fits my translation. The 학이란 is 學之[也]. The 이란 would be the translation of the omitted 也. The 말하다 is not a translation of the 言, but a translation of the 爲. As I wrote, you can translate 爲 as “to be” or “to mean.” 말한다 translates it as “to mean.” In your translation, you had trouble figuring out how to explain 爲, didn’t you? The 言 (verbally) acts as the adverb for the verb 效 (imitate), meaning that one should “imitate verbally,” not physically. The Korean translation just assumed people would understand it to be verbal imitation, which is why it was not translated. A better translation in Korean would be

        Below is a good translation of the first saying and its annotation. The only thing they forgot to translate was 說悅同, which means “說 and 悅 (열) are the same.” This is saying that 說 should be interpreted as “pleasure” (悅), not “to speak.” As you know 說 can mean both “pleasure” (열) and “to speak” (설). 悅 (열) is the more common word for “pleasure,” which is why it was used to define 說 (설/열). A more precise Korean translation would be “학지란 말로 흉내내는 것이다.”

        1.1 1. The Master said, “To continually practice what you have learned – is
        this not a pleasure?

        “To learn” means to imitate.i People’s natures are all good, but in becoming aware there are the earlier and the later. Those who become aware later must imitate what those who become aware earlier do. Then they can become enlightened about goodness and return to their start. “Practice” is like the repeated flapping of a bird’s wings.1 Learning never ceases like a bird repeatedly flaps its wings. When one has learned and then continually practices it, then what one has learned becomes familiar and there is pleasure in one’s heart. One’s progress naturally could not stop! Cheng said, “One repeatedly reflects on it and immerses oneself in it, hence one is pleased.” He also said, “Learners must practice it. If one ‘continually practices’ it, what you have learned will really be in you. Hence, you will be pleased.” Xie Liangzuo said, “To ‘continually practice’ is for there to be no time that one does not practice. ‘To sit gravely’ is to practice it while sitting. ‘To stand at attention’ is to practice it will standing.”ii

        PDF file

      • gbevers said:

        I meant write “학지란 말로 흉내내는 것이다.” in the paragraph where it now appears. I got confused.

      • 歸源 said:

        Actually, we might be both wrong. I have a set of books with a very literal Korean translations of the Confucian Classics. That book and other Korean translations online parse it like this:

        學之爲言 / 效也.
        NP / VP

        言 is being modified by 學之爲, and is part of the noun phrase 學之爲言. The translation it gives is: “배운다는 말은 본받는 것이다.” That book also explains that 言 (word / to say) should be viewed as 字 (character), and 學之爲言 should be viewed as 學字. 爲 merely means “to pursue” or “to do.” A之B constructions as you know can be translations as B of A or A’s B.

        Although grammatically correct, I would still hesitate on using 言 adverbially (i.e., “verbally”). Both the Korean and the English translations you cited don’t do so. I’m assuming this is the first exposure you have to Confucian annotations. If you do a search for the use of “言” in annotations, you will find plenty of examples of A言B也. In addition, in Neo-Confucianism, in sum, every person’s goal is to attain sagehood. It wouldn’t make sense to just stop at emulating the sages verbally. For instance, they followed the three year mourning period mentioned in the Analects.

        Moreover, I’m only a hobbyist, and am learning as you are. I’m not an academic in these matters. I think there might be textual critical analyses of Confucian classics. I have another Korean book that includes some of them. If you want to get into the weeds of this, go there.

      • gbevers said:

        He was not talking about imitating the sages verbally; he was talking about repeating after one’s teacher, explaining what “learning” meant. He was defining the vocabulary in the sentence, as you and I are trying to do here, and trying to explain exactly what Confucius was saying. And what Confucius was essentially saying was that learning is fun. So, Zhu Xi was explaining what Confucius meant by “learning” and “practicing.”

      • gbevers said:

        By the way, I like getting into the weeds, but I think we have pretty much beat this horse to death, in regard to this particular Confucian saying. The annotation is another question. I copied and pasted the translation of the annotation, but there are parts of that translation I disagree with and parts that seem to have been left out. We can still discuss the annotation if you are interested.

        Also, I do not consider myself a hobbyist because I have been spending quite a bit of time trying to teach myself literary Chinese over the past few years. Therefore, I would call myself “a self-taught student.” I would love to study under a teacher, but there are no teachers I know of in this part of Texas, especially teachers who teach with the Korean pronunciation of the characters.

        In the next few weeks I hope to be self-publishing a book that I hope will help a lot of Korean language students learn to read classical and literary Chinese.

        The problem I have found trying to learn literary Chinese from Korean texts is that it is not very efficient since the Chinese language is more like English than Korean in many ways. For example, when you translate Chinese into English, you can pretty much translate from left to right and get a good sense of the sentence, but since Korean puts the verb at the end of the sentence, it slows down interpretation, though there are parts of Chinese grammar that can be better explained with a Korean expression, such as the relative clause.

        Also, most of the English texts that teach literary Chinese teach it with Chinese pronunciations, something I do not feel like learning since I will be reading the Chinese, not speaking it. Therefore, Korean pronunciations work just fine for me, especially since my ultimate goal is to read and translate old Korean texts, not Chinese.

        Another problem with both the English and Korean texts is that they do not seem to be very systematic in teaching beginners to read and build vocabulary. For example, I have Paul Rouzer’s book, “A New Practical Primer of Literary Chinese,” which, as implied in the title, claims to be a primer, but by Lesson 11, he assumes you have enough understanding to translate the Chinese passages on your own and also no longer need practice exercises. He also gets lazy and provides what he calls “Vocabulary Hints,” which are just characters with reference numbers next to them. That means you have to thumb through his bulky book to find the definitions of those characters. Also, his readings are too full of the names of people and places, which means you are being introduced to too many relatively obscure characters and not enough to more functional characters. Finally, the English and Korean texts do not seem to have enough repetition and review, which a dummy like me needs to learn characters and build vocabulary.

        So, since none of the texts on the market completely suited my needs, I decided to make my own textbook based on an old, out-of-copyright text. If I could steal the title, I would call my book, “Literary Chinese for Dummies” since I only assume the reader knows how to read the Hangeul pronunciations and knows the basics of writing Chinese characters.

        I am hoping that my book will be the book that Korean-Americans and English-speaking, Korean Language Learners will use to teach themselves to read classical and literary Chinese. I am also hoping that you get the book, Kuiwon, and tell me and the world what you think of it.

      • 歸源 said:

        Yes, I would be very interested in such a work. I was actually thinking about compiling a work like that of my own into an e-book. I already have a few posts on Classical Chinese grammar. I have found English works on Classical Chinese generally much lacking compared to Korean works. I have noticed quite a few Korean works compare Classical Chinese grammar to English — and even German — because of word order.

      • gbevers said:

        I hope you do publish something, Kuiwon. We definitely need more English books that discuss it with Korean pronunciations. One good thing about the Rouzer book is that he does provide Korean translations for the character list, along with the Chinese.

  5. gbevers said:

    Kuiwon, I don’t understand why you think my translation is grammatically incorrect. “A爲B也” is also a sentence structure. It is the same as and an AB也 structure, but 爲 was most likely used here clarify that the A is 學之, not 學之言. The A here is “to learn it” (學之), and the B is “to verbally (言) imitate (效),” or you could translate it as “to speak (言) [and] imitate (效).” If I added the topic marker 也 to 學之, it might be clearer to you.

    學之也爲言效也 is the sentence with 也 added: “To learn (學) it (之也), is (爲) to verbally (言) imitate (效也).”

    The 也 works like the topic marker 는 in Korean.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: