The Future for Chinese/English Dictionaries

RIDICULOUSLY LONG ARTICLE AHEAD: I’ve had this stuff in mind for about 6 years and I finally just needed to get it off my chest. I don’t imagine future posts will ever be as long as this.

For a printer-friendly version in MS Word, click here:

  The Future for Chinese/English Dictionaries (.doc) (2,056 hits)

 The Future of Chinese/English Dictionaries

Albert Wolfe, October 2011

Intro

This blog is usually for us “Old Hundred Names” (common folk) learning Chinese, but this article is different because I’m appealing for help and trying to cast a vision for the future. Unlike my usual posts that have something immediately applicable for learners of Chinese, this article is one step removed and talks about a gap in the materials, something we learners of Chinese need but can’t get for ourselves: a complete and useful Chinese-English/English-Chinese (C/E) dictionary.

In this article I’m going to describe the ideal “Super Dictionary” of the future in hopes that people who can bring it about (which includes us Old Hundred Names, as you’ll see) can make it so.

This article is inspired by:

  • Six years of frustration with the currently available C/E dictionaries.
  • MDBG: the closest thing I’ve seen to the community project required to solve the problem.
  • The Longman English Dictionary: a model for the sort of information a foreign language learner needs in a dictionary.
  • The advent of CreateSpace: high-quality, super cheap, print-on-demand self publishing.
  • Skills 4, 9, and especially 10 from The Institute for the Future’s “Future Work Skills 2020 (pdf)” report.

Outline

The Problems

Not Complete

Missing Entries and Definitions (E→C)

Missing Entries and Definitions (C→E)

Missing corresponding words (C↔E)

Not Useful

E→C: Can’t find the word everyone uses

E→C: Too Many Choices (and Different Dictionaries Don’t Agree)

C→E: Too Many Definitions

The Problem of Pages

The Solutions

What We Need

How We Can Get It

Challenges

Action

The Problems

We learners of Chinese turn to dictionaries to answer the following two questions:

  1. What does  (Chinese word)  mean in English?
  2. How do you say  (English word)  in Chinese?

Problems trying to find the answer to question number one, going from Chinese to English (C→E), are less common than problems with question number two. But there are still a disturbing number of times when the dictionaries we consult aren’t complete enough to give either the appropriate definition of a Chinese word or the word itself. There are also some English words that do not appear as headwords in the dictionaries, making an E→C search frustrating.

The majority of our problems with C/E dictionaries come from trying to answer question number two, going from English to Chinese (E→C). We cannot trust that the Chinese definitions given in our dictionaries are the appropriate ones to use. Furthermore, cross-checking multiple dictionaries often yields different words rather than confirmation of certain words. So we are at best insecure about our word choices and at worst secure but wrong. We need a more complete and more useful dictionary.

Not Complete

Missing Entries and Definitions (E→C)

  • Common words and phrases. “Velcro” and “grade on a curve” are missing from every dictionary I’ve ever seen. (Velcro has been added to MDBG but I still don’t know how to say “grade on a curve”, and I’m even required to do it at my college here inChina. They just explain it in terms of the number of As, Bs, etc. I’m allowed to give.)
  • Special / technical terms. For example, medical terms seem to be getting slowly added to the online dictionaries. I needed a new Albuterol inhaler and had to figure out how to say it without the aid of a dictionary. (Here’s the story of how I found out and then added it to MDBG).

Missing Entries and Definitions (C→E)

  • Common words or definitions. “Yàobù” 要不 can mean: “How about (we do something)…?” That particular meaning doesn’t appear in any of my paper dictionaries and only appears in sentence examples at Nciku but not in the definitions. (It has since been added to MDBG.)
  • Proverbs / idioms. One of my English student asked me how to say “yīn ài chéng hèn 因爱成恨” in English (so we know it’s commonly known and used, not just some ancient literary phrase). I didn’t know so we looked it up. None of the online dictionaries nor my paper dictionaries including a specialty proverbs/idioms dictionary had it, yet it got over 2 million hits on Google. (I’ve since added it to MDBG.)
  • New slang. Of course, every language is evolving. That’s why adding new slang like “gěilì” 给力 is essential for us to keep up with the modern usage of Chinese.

Missing corresponding words (C↔E)

Paper dictionaries have a special problem that online dictionaries don’t have to deal with because of the nature of online searches.

I’m going to pick on the Oxford Minidictionary (affectionately know as “Chubby“) for a moment. Here’s the entry under “shower in the E→C side:

“n (for washing) línyù 淋浴; a shower yí gè línyù 一个淋浴;to have a shower xǐ línyù 洗淋浴(rain)…”

And then it moves on to rain, which we’re not interested in right now.

Now the problem is that in the C→E side you can find this entry:

洗澡 xǐzǎo vb to have a bath/shower”

So, why isn’t “洗澡 xǐzǎo” listed in the E→C side? It’s an oversight, that’s all. This is just one of many examples of inconsistent internal cross-references that occur in every paper C/E dictionary I’ve ever used (see another example).

Not Useful

Even though incomplete dictionaries are frustrating because we can’t find a word we’re looking for, the problem of usefulness is much more urgent. When we want to find how to say an English word in Chinese, even if the dictionary contains an entry for the English and Chinese, we cannot trust that the word we get is the right one.

E→C: Can’t find the word everyone uses

The Problem of “Shower”

First of all, “xǐ zǎo” 洗澡 is definitely the most-used word for “to take a shower / to bathe” all over the country. I can even remember a joke (one of the few I understood) from the Spring Festival Variety Show (春节晚会) a few years ago when one of the actors used it. There are other ways to say it, and the noun and verb form are different (as confirmed anecdotally by this post), but I’m convinced that “xǐ zǎo” 洗澡 is the word to use.

So we need that to be indicated in the dictionaries. Remember, the chubby little Oxford Minidictionary only gives “xǐ línyù” 洗淋浴and I’ve never heard that used once. My shower post on this blog confirms that xǐ zǎo” 洗澡 is missing from other dictionaries. Yet it is the word everyone seems to use for something they do every day. That makes the dictionaries useless for answering the question, “How should I say ‘shower’ in Chinese?” And there are many other examples just like “shower.”

The Problem of “Go”

My friend Brad asked me a question his first month inChinathat perfectly illustrates another aspect of the learner-unfriendliness of our dictionaries.

“Hey, how do you say ‘go’? You know like, ‘I’m going now.’”

I explained “go” in English can be translated into many different words in Chinese, but in this situation “zǒu” is the best.

He couldn’t find the answer himself in the dictionaries. I checked the two little dictionaries I always recommend, “Chubby” and “Lenny” (the little Langenscheidt), and neither helped.

Lenny gives “qù” first (which would be used in “I’m going to China”). Next is “líkāi” 离开 which would work for Brad’s example, but isn’t as common as “zǒu” , which appeared halfway down the list under “I must be ~ing”. The translation is good: “wǒ děi zǒu le” 我得走了. But there is nothing to indicate which of the four words is “go.”

Chubby gives almost a full two pages to “go” and various collocations of the word, arranged in alphabetical order rather than by usefulness or frequency. First is “go across” chuānguò 穿过 and then “go after” (physically) zhuī . “Zǒu” makes a few appearances, but is buried among such nuggets as “go off (when talking about food becoming bad)” biànzhì 变质.

Online dictionaries don’t help much either. A search at MDBG for “go” gives 100 results (on the first page). “Qù” is number 9 and “zǒu” is number 15. A search at nciku shows first (no pinyin until you hover your mouse) and then “zǒu” after about 20 other entries.

Poor Brad. He was deluged with information and the dictionaries gave him no guidance as to what information is more or less important. The dictionaries need to be sorted for learners in terms of usefulness rather than just alphabetically. Since they aren’t, he really couldn’t find the word he needed without asking for help.

E→C: Too Many Choices (and Different Dictionaries Don’t Agree)

The Problem of “Spoon”

Finding out how to say “spoon” is hard too. Just to clarify, this is not a Western invention that is rare in Chinese restaurants (like the fork). I’m not demanding the Chinese come up and agree on a word for a strange foreign object. I’m talking about the sort of spoon that every single restaurant inChinabrings every customer automatically with soup or fried rice, and has been doing so for centuries.

Let me show you the entries for “spoon” in the various dictionaries I’ve got lying around:

The little “zi” and “r” endings may not be as important as the other differences. But still, how should I, a learner of Chinese, go ask the waitress to bring me a second spoon at the restaurant tonight? And what’s the difference between all of the choices? Are some more “correct” than others? Judging from the comments, the differences seem to be largely regional. So which word is most likely to be understood no matter where I go inChina? These are questions that I can’t find the answers to.

The Problem of “Bus”

Sometimes, the differences between the words are more about usage than region. Take “bus” for example. I’ve heard all the following words used for “bus” in the same general area:

The word you choose depends on the situation you’re talking about. For example, “bus station” uses qìchē 汽车 but “city bus” is usually gōngjiāo chē公交车. To be truly useful, the dictionaries need to explain the usage differences and give example phrases or sentences to illustrate the differences.

C→E: Too Many Definitions

Especially in online dictionaries, where no editing choices need to be made to save pages, there are often just too many definitions for a single Chinese headword. As a result, the guys at Skritter have started trimming down MDBG’s CC-CEDICT database, which Skritter uses for their excellent writing training site. They’ve ended up creating their own version of the MDBG dictionary—one that they feel is more useful to learners.

For example, here is MDBG’s definition for “jiǎ” :

jiǎ : first of the ten heavenly stems 十天干[shi2 tian1 gan1] / (used for an unspecified person or thing) / first (in a list, as a party to a contract etc) / armor plating / shell or carapace / (of the fingers or toes) nail / bladed leather or metal armor (old) / ranking system used in the Imperial examinations (old) / civil administration unit (old)

And here’s Skritter’s definition:

jiǎ : one; armor (1st Heavenly Stem)

For beginners, I think Skritter’s is much more useful. I would suggest adding “nail (finger or toe)” to Skritter’s as well. But the point is: Skritter has helped the learner sort through the huge volume of information by simply removing what they feel is less important.

Just to give MDBG a break, the goal of MDBG and the CC-CEDICT database behind it is to provide a one way, Chinese-English translation tool that provides the most complete English definition list possible (including all the ancient meanings). But for learners of Chinese, it’s not as useful as we’d like.

There is a solution that can meet both goals: weight the definitions for usefulness rather than removing them. I’ll explain how in the next section.

But sometimes the definitions are so out of date they’re laughable and need to be edited rather than just weighted. For example, compare the following:

(MDBG) tāo tāo bù jué 滔滔不绝: unceasing torrent (idiom) / talking non-stop / gabbling forty to the dozen

and

(Skritter) tāo tāo bù jué 滔滔不绝: (saying) talking non-stop; gushing; torrential

I think it’s obvious that the Skritter dictionary’s definitions are more appropriate for this century. In this case I would recommend changing the MDBG definitions rather than just weighting them.

The Problem of Pages

It would be impractical to include every Chinese and English word in a printed dictionary (see what happened when a UK student printed some of Wikipedia). Editors have to be selective. But before that selection can happen, all the data and definitions need to be compiled.

I understand that printers of dictionaries such as my old favorite the Oxford Minidictionary have to pick and choose what to add. But I have the distinct feeling that those choices are not made very scientifically. For example, consider this entry that made it into the 633 pages of the little dictionary:

showjumping n qímá yuè zhàng yùndòng 骑马越障运动

And yet headwords such as “similar” are missing.

How are the decisions made about what to include in paper dictionaries? Could it be that one of the editors, Boping Yuan or Sally Church, was interested in showjumping? I’m interested in a more scientific approach involving sorting huge amounts of data for frequency, popularity, and usefulness to inform the choices of what to include or not.

To do that, we need to make the core database for the Super Dictionary an online resource. Online dictionaries have the advantage of virtually limitless space that anyone on the planet can access at any time. At the time of this writing, there is still no single database that contains all known words and phrases.

Of course, there’s no way to ever have a 100% complete Chinese-English dictionary because of the nature of language change. There will always be new slang and new terms coming out. But we haven’t even got the old ones all compiled into one place yet.

Once the Super Dictionary is reasonably complete, the task becomes sorting and arranging the information and definitions into the most useful, learner-friendly format possible. Then various printed books can be produced if there’s interest.

So how can we 1) get that information, 2) arrange it in order of usefulness? I’ve got a few ideas.

The Solutions

Disclaimers:

  • I’m a horrible business man. (As proof: there are no advertisements on this blog and I give all my music away for free.) I don’t have any plan for how I, personally (nor anyone else, for that matter), can make money off the community project described below. I don’t claim any ownership of the ideas, data, information structure, or processes described below. All I care about it getting the information to the masses (and going on record as being the one to propose this project). If I can participate in the project in some way, I’d be delighted.
  • I’m a volunteer editor for the MDBG dictionary (although I’ve been pretty uninvolved recently).
  • I’ve recently published a book with CreateSpace at my own expense (coming out soon).
  • I have no other affiliations with any of the other individuals or companies mentioned below. None of the individuals or companies mentioned (including MDBG and CreateSpace) have agreed to participate in a project like the one described below, nor have any individuals or companies expressed any approval or endorsement of the ideas presented in this article. They are simply cited as examples of popular online services that could be included in a community project in the future if they agreed to do so.

What We Need

Look at all the info the Longman Dictionary give English learners who look up “drink”.

There are more definitions and sentence examples that I cropped off.

This serves as an excellent model of a very useful learner’s dictionary. We need something equally useful for our Chinese/English Super Dictionary.

1. Pronunciation

We need pinyin (and variations for the pinyin) for every word and sentence example. Many dictionaries neglect the pinyin for sentence examples. I’ve never quite been able to figure out why they give pinyin for the headword, but leave it out for the usage examples.

2. Part of Speech

“Chinese words don’t have parts of speech” is a common myth. Sure, some words function as verbs, nouns, and adjectives, but that’s true in English as well. And sometimes there is a clear difference in Chinese. For example, “héshì” 合适 (meaning “suitable”) usually functions as an adjective and “shìhé” 适合 (meaning “to be suitable for”) as a verb.

Chinese parts of speech may not directly correspond to English ones but we need them labeled anyway. For example, Chinese particles like “le” 了,”ba” 吧,”ne” , etc. don’t have English equivalents. But we still need them indicated as “particle” (or something) in the dictionary.

Adsotrans seems to have the most data about parts of speech at the moment, but it still needs work and could benefit from a team of users and editors constantly improving it in the way I’m suggesting. For example, at the time of this writing the two entries for “jiǎ” (meaning “1st, finger nail,” etc.) list the parts of speech as “NOUN” and “HEAVENLY,” respectively.

3. Extra Info about Part of Speech

Learners of English need to know if a verb is transitive or intransitive, how the verb is conjugated into different tenses, and whether a noun is countable or uncountable. Learners of Chinese need to know, for example, what measure words are associated with which nouns, and which category a verb falls into (“stative, activity, achievement” are the three categories given by Claudia Ross).

4. Frequency Data

The Longman entry for “Bicycle” has this icon  meaning it’s in the top 3000 written words yet “bike” has this icon  indicating it is in the top 2000 spoken words. So English learners can assume that “bike” is less formal and “bicycle” is more formal. Now the decision of which synonym to use in which context has been made easier: use “bike” when talking to your friends and use “bicycle” when writing a business contract or even a note.

Something like that would certainly help us sort through the ocean of synonyms in Chinese.

5. Sentence Examples

The Longman sentence examples are all extremely practical and useful. Compare them to the sentence examples given for “hē” (drink) at JuKuu and nciku. My favorites from the first few entries are:

7. He is fond of a dram. 他喜欢喝一点酒。(Jukuu)

1. We can’t solve these problems through arranging dinners and parties.  这些问题不是靠吃吃喝喝就能解决的。(nciku)

Not only do these example fail to use the word “drink,” notice that pinyin is not given at JuKuu and is only available on mouse hover at nciku (see number 1 in this section).

There is a new community project called tatoeba that links sentence examples from a bunch of languages together. They seem to be much more on the right track. The first three sentence examples from this search for “drink” give excellent usages of the word “hē” and provide pinyin and hanzi.

6. Other Info (not pictured)

In addition to the examples in the above image from Longman, we learners of Chinese need the following information in our Super Dictionary:

  • Regional differences in usage and pronunciation (蜗牛 wōniú / guāniú [taiwan] = snail).
  • Category and meta tags (chuānghu 窗户 = window [building, vehicle]; chēchuāng 车窗 = window [vehicle]; shìchuāng 视窗 = Windows [computer operating system]).
  • Formality tags (qīzi 妻子 = wife [formal]; lǎopó 老婆 = wife [informal]).
  • Traditional / Simplified variants for characters (中国 [simplified] / 中國 [traditional]).
  • R-hua (érhuà) 儿化 variants and whether that changes the meaning (tóu = head / hair; tóur 头儿 = leader).
  • HSK levels and TOP levels for learners who are preparing for those tests.
  • Literal breakdown of hanzi (fēijī 飞机 = airplane [fly machine].
  • Radical breakdown of characters ( = + ) and what each radical means (if anything).

MDBG and nciku have many of those things. But regional differences, frequency (or even popularity) ratings, and all the parts of speech and category meta tags are still either completely absent or incomplete.

It’s too much to ask anyone individual, or even company to do all this work. That’s why I propose a system of uniting everyone together to collect and constantly improve on the data we all need.

How We Can Get It

Because of the complexity of the project I’m envisioning, I’ve created a little diagram to use as a guide for discussing the various points.

1. Users register for Super Dictionary

Currently, the main online dictionaries don’t require registration for use. I think that’s fine. But there should be an opt-in system for allowing users to help improve the dictionary. As long as users understand the reasoning behind the registration and have assurances that private data will be protected, I think people would be willing to pitch in to improve the world of Chinese learning.

NOTE: This diagram is only showing what Chinese learners would do. But getting native Chinese speakers (learners of English) involved is essential too. Maybe each user’s first language (and region) should be part of the registration process to round out the data. I haven’t thought that all through yet.

2. They are associated with a region

The regions on the map (image credit) are simply an example. Rather than choose pre-determined regions and ask users to “pick the kind of Chinese you want to learn,” I think it would be better for users to agree to say where they are learning Chinese and then let the computer extrapolate regions from the data later. For example, if users say which cities they’re in, the computer can then look for patterns and see which cities (and then larger regions) use which words.

Also, for users who are not in a Chinese-speaking environment, they should indicate which area they are most likely more exposed to. For example, if my wife were fromBeijing, I’d list my region as “Beijing” even though I might be living inDenmark. If my Chinese teacher atCaliforniaUniversityis fromGuangzhou, I’d put “Guangzhou” as my region.

It wouldn’t be completely useful all the time, but the important thing would be to start collecting the information and then let the computer start looking for patterns.

3. Their usage data goes into to the database

Raw Popularity Weighting

The system should watch search queries and the user won’t have to do anything different for the computer to start learning. For example, I’m in Guangzhou and I search the dictionary for “pāituō” 拍拖. The computer should remember that someone searched for “pāituō” 拍拖 and also that the person was in Guangzhou.

Then, the user can help even more. When I see the results for “pāituō” 拍拖, I realize that the context that I heard it in was “to court” or “to date”. So I can click a little link on that definition that tells the computer: “This is the meaning I heard for that word.” So the computer can start to figure out that “to court” is a popular definition for “pāituō” 拍拖.

This could also help show where new phrases start from. For example, I’ve heard that “pāituō” 拍拖 is being used more and more all over the country. A system of regional tracking could corroborate the hypothesis that it originated in Cantonese-speakingGuangdongprovince.

Spoken vs. Written Popularity

It can go one step further by offering two icons, let’s say an ear   and an eye . Now, if I heard “pāituō” 拍拖 in a spoken context (if it came up in conversation, for example), I click the ear icon above “to date”. So the computer learns that it should give that definition more of an oral weighting.

Conversely, when I look up “sell” in English, the computer makes note that “sell” has been searched for. Then I see the definitions and click the ear for “mài” and the eye for “shòu” . That means that “mài” is used more for oral Chinese and “shòu” is more for written Chinese.

Of course, many times, especially when going from English to Chinese, users won’t know which word is used in more spoken or written contexts. That’s fine. You don’t have to click anything.

Missing Entries

If a user searches for something that’s not in the dictionary (for example “velcro” or “gěilì” 给力), the computer will show “no results found” but should send a note to the experts so they can determine whether it should be added or not. According to this, nciku already has some of this sort of system in place but not as sophisticated as the one I’m proposing.

4. Experts add data common users can’t

Finding the experts will be the biggest problem. I think academic institutions and volunteers could be enlisted, but that’s its own issue.

Assuming for a moment there is a team of experts, they will serve mainly to correct and enhance the data that are coming from all the common users. The experts could be associated with a region as well to monitor how geography affects their decisions.

They would be able to add (or approve) tags to entries regarding any of the following:

  • Written variations (traditional / simplified, alternative characters, etc.)
  • Spoken variations (zhè / zhèi for ; wōniú / guāniú for 蜗牛)
  • Formal / informal register tags (qīzi 妻子 [formal]; lǎopó 老婆 [informal])
  • Part of speech tags (héshì 合适 [adjective] ; shìhé 适合 [verb])
  • Sentence examples that are useful, common, and not too long.

BONUS: Corpus Data

What we really need is some objective data for written and spoken frequency. Longman put together their own corpus to get frequency data.

Frequency of single Chinese characters (zì ) in written material (like newspapers) has been done for a long time (at least since 1993). But we at least need research on the top 1000, 2000, 3000 multi-character words (cí ) both spoken and written.

I’ve recently become aware of Jun Da’s corpus data that does take into account multi-character words. Perhaps if someone like Professor Hongyin Tao at UCLA or Dr. Richard Xiao at the University of Central Lancaster could be persuaded to join the team. Xiao compiled The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC), a collection of natural and scripted Chinese conversations and transcripts. Professor Tao was also heading up work on the UCLA Chinese Corpus. Work on both projects seems to have stopped in 2008. However, in 2009 Xiao and a few other editors published a Chinese dictionary with frequency data! I haven’t seen it myself, but according to the reviews on Amazon, it’s not as useful as it could have been (for example, it’s missing pinyin for sentence examples). But it’s a step in the right direction.

If Xiao’s, Tao’s and Jun Da’s data were combined (just to name a few) we could have a huge, useful bank of info that could provide guidance to the users of the Super Dictionary. Would they allow that? What’s in it for them to team up? I’m not sure what their goals are so I can’t answer that.

Also, the challenge of getting a spoken corpus will be great. I think QQ chat transcripts might be useful for that, but who’s going to opt in for sharing those?

5. Online / Mobile Services Use the Database

The Skritter guys had to do a lot of work trimming down the dictionary because the MDBG database they imported wasn’t exactly what they wanted. Now, if I want to start my own website I have to “reinvent the wheel” and start from scratch. Why not let the work that the Skritter guys have done benefit everyone in the future as well? If they’re willing to make their data available, they could provide weighting information (rather than removing things that someone might want to add back in later) so that the Super Dictionary is smarter about which definitions are more useful or important to them.

But it’s not just for services who’ve created their own dictionary (like Skritter has) to contribute back to the Super Dictionary. Let’s take Pleco and their iPhone dictionary as an example. The newest version of Pleco allows you to point your iPhone’s camera at any hanzi text and it’ll translate it for you (see cool video demo). So how about letting the Pleco users opt in to send that info back to the Super Dictionary? Then we’d start to see data about how many users scanned which characters and we’d start to know that those characters are at least in common use.

The Super Dictionary database would also benefit from knowing that there are 2,000 Skritter users who have such-and-such character in their vocab list, but only 20 users who have this other character.

It would need to be determined exactly how to use the data, but it could only help.

I just had a conversation with Ben Whately, co-founder of Memrise, and he said he was getting ready to import a dictionary database because they have more important things to focus their energy on than making a dictionary (they’re compiling some very exciting data that I hope to discuss in a future post). The problem he’s facing is whether to use Adsotrans (which is maintained by David Lancashire of Popup Chinese) or MDBG/CC-CEDICT. I told him I didn’t really think either was good enough on its own.

What he really needs is the Super Dictionary that includes both plus all the work that the Skritter guys and other people have done. And wouldn’t it be great if Popup Chinese also shared the data about which words they used in their learning materials? They wouldn’t have to make the learning materials public, just the data.

But would companies see that as helping their competitors? I don’t know. My hope is that if the Super Dictionary were available for free, and everyone were contributing to making it better and better, it would make new companies focus on offering new services and new tools rather than making their own dictionaries. Instead of a lot of different wheels being invented, we’d start to see a lot of new vehicles (and maybe some toys!) using the same wheels.

6. Paper Books Can Be Printed

CreateSpace has the following characteristics that make self-publishing a paper dictionary extremely desirable:

  • Very high quality printing and many paperback book size options available.
  • All you have to do is upload a PDF and they make the book. You buy the first book (called a proof) for about $5-7 dollars and then the book is available on Amazon for the world.
  • It’s completely free for the author to set up a book. Various add-on options are available for a price, but CreateSpace make their money each time the book sells rather than when it’s set up. Even the ISBN is provided free by CreateSpace. You only pay for the book itself when it’s printed.
  • It takes about one week from the time you upload the PDF to the time the book is ready to print. That means if you want to update the book, you upload a new PDF and a week later, the new version of the book is ready. You can do that as many times as you want.
  • The book goes directly on to Amazon so that others can benefit from it as well. You set the price of the book (above a certain minimum price based on the cost of production). CreateSpace prints and ships the book within 24 hours of an order made on Amazon.
  • An infinite number of dictionaries with various options can be printed as long as there’s a PDF for each one.

That means that as long as there’s something built into the Super Dictionary that allows users to select various options and output a PDF, the information doesn’t have to stay locked into an online format. The options could include the following:

Total Words

You would be able to choose how many headwords you want in the book. You would also be able to choose how many definitions you want for each word. For example, if you just want a little travel dictionary, you might go with only the most frequently used and popular words and definitions. In this case you’d have the definition for “jiǎ” be only “one; armor; nail (finger or toe)”. But if you’re going to be doing some sort of scholarly project, you might want the full list of definitions.

Region

You could also produce a special dictionary for certain regions. In other words, if you’re going toTaiwan, you could have it just use the words that are commonly spoken inTaiwan. It would save on pages and then you’d be reasonably certain that you’ll be able to say what you want to say for your area. You could also choose to have traditional characters, simplified, or both.

Sentence Examples

If you want to save more pages, you could opt out of having sentence examples included. If you did want some, you could choose how many sentences examples to include and only use ones that have been marked as popular or useful.

Pronunciation

If you’re a learner of Chinese, you’d want to have pinyin for everything (including sentence examples). But you could choose to have the Chinese side ordered by hanzi or pinyin (most dictionaries order by hanzi but that’s not necessarily the easiest thing for learners).

If you’re a learner of English, you might want to save pages by eliminating pinyin entirely. But you might appreciate English IPA included. That could be one of the options.

Challenges

I’m not aware of anything like the proposed system for the Super Dictionary.

Just to summarize, I think the innovations with this system would be:

  • Combining dictionary data from competing, or at least separate, companies into one master database.
  • Adding regional data based on user location.
  • Weighting headwords and definitions for popularity based on user searches and direct “thumbs up” style voting.
  • Infinite, customizable print dictionaries created from the data.

Wikipedia is a good example of the power of collaboration, but there’s nothing built in to allow third-parties to use and then contribute data back into the database. They’ve got the support of their own foundation (the Wikimedia Foundation), which we don’t have. Also, Wikipedia offers printed books through a company (PediPress) that handles all the printing details. We don’t have anyone like that helping us either. But we don’t really need it because CreateSpace is so easy to use anyone can make a book! However, there are a few issues to solve before this can all come to be.

Legal

Who would “own” the Super Dictionary? There are “copy-left” licenses that could be applied, but decisions will have to be made, and some thought needs to be given to the legal side of this project. Also, participating users and companies will need to know what their rights are.

Talent

The sort of project will require a team of very smart people who are good at not just programming, website design, and user interface but also data management, statistical theory, and also have some savvy about China and the Chinese language. I can’t do it. I’m not sure any one person can. And even if we found one person who could do it all, would he or she work on it for free?

Money

Even if we found a team of talented, motivated people who would volunteer their time for this project, the server and bandwidth still costs money.

Nicholas Carr, former executive editor of The Harvard Business Review, said in an NPR interview about Wikipedia and user-created web content said: “Pretty much the only business model in what’s called ‘web 2.0’ is to get as many people as possible to look at your site and then feed them advertisements.” Carr seems to be saying the future of making money off information, isn’t by owning the information itself, but by advertising revenue. I’m not sure whether that would be enough or not.

An academic institution or foundation that just has a bunch of money lying around would be a great solution to the problem. Anyone know someone like that?

I would hate to see collaboration with other companies discouraged because of licensing disputes or making the partner companies shoulder the financial burden of the project. If some companies were willing to chip in and it didn’t discourage their participation, then that would be fine. But I think some thought needs to be given to the bottom line.

Collaboration

The success of the project depends largely on how many partnering companies can be brought together to share their vocabulary usage data (point number 5 on the diagram). All businesses must ask the question “But what’s in it for me?” I’m not sure my answers are good enough for the bottom line: the quality of your product will improve as the database improves. Also, it might be good press for participating companies. I’d love to see some sort of logo that gets slapped on each participating website that shows they’re using and contributing to the Super Dictionary database. If users were educated about what it meant, it would mean the customers would have more confidence in joining a service that’s participating rather than one that’s independent.

But should all companies who want to participate be allowed? Who’s going to screen them? How will their data be used / weighted? These are all problems to solve.

Printing

If the Super Dictionary does lead to a print book (or many print books) as I hope it will, how will that come to be? How will the PDFs required to print the books be generated? Where will the revenue from the printed books go?

Multilingual

There’s no reason why only English should be used for the Super Dictionary. But making a Chinese-multilingual dictionary is a much bigger project than just Chinese English. Still, with the right talent on board, it might make more sense to design it to accommodate other languages from the beginning so even if it starts out as only Chinese/English, it could be expanded to Chinese/you-name-it more easily.

Action

So what to do now? Any suggestions? You’re welcome to leave comments here on this blog. Or someone could start a Google Group or something. I’ll help if I can.

Comments

  1. why the hell do you want to know 11 ways of saying spoon? As a native speaker of Chinese, I don’t even know some of them.

    Here is the thing, do you want to learn a language for daily uses, or do you want to be an expert?

  2. Great post and great idea, Albert. You limit this to Chinese, but I think you’re really speaking of a problem that is universal with foreign-language dictionaries.

    To give an example, Japanese has this thing called pitch accent that’s pretty much never covered anywhere, and yet it’s key to sounding like a native. It’d be almost as bad as completely ignoring tones in Chinese, as in some cases pitch accent is all that distinguishes what are otherwise homophones (although, like Chinese, context is often sufficient to make things clear).

    @kencanau: The dictionary he proposes would cover both learners and experts. You’d want the 11 ways to say spoon so that when you come across one in Chinese, you can figure out what they mean, even if it’s obscure. However, if the frequency data is incorporated, there’d be no need to learn the 11 ways; you’d just learn the most frequent ways and ignore the rest.

  3. Now that Chinese is becoming a more and more popular language for English speakers to learn, I have a feeling that simple demand will force dictionaries to improve, thanks to folks like you who aren’t just complaining but actively trying to address these frustrations.

    One pet peeve of mine, that you mentioned, is the lack of measure words listed with each noun. Thank goodness for MDBG, because none of my paper dictionaries indicate which measure word goes with a particular noun that I’m looking up, unless it just happens to be mentioned in the sample sentence.

  4. Now that everyone has mobile phones, can’t we just wiki the shit out of this thing? I mean, crowd sourcing, right? An online dictionary, verified by Chinese speakers, linked up to Google Goggles (?) so I can snap a photo of a Chinese sign, and immediately get the translation in very good English.

    Chinese dictionaries would be awesome if we had thousands of natives speakers arguing over what X really means. That’s crowd-sourcing, right?

  5. Sounds like an excellent plan. I agree any such dictionary should use the power of today’s smartphones to the full extent possible. It’d be great, for example, if you could get support for this into Plecodict. I think it’d be worth sending the people you mention a link to this blog post, see what they think.

    I have some knowledge of database design and programming, and would be happy to help a bit, but my skills are nowhere near good enough to make this project come to life. Then again, as you said, I don’t think anyone at all has all the skills that you need. You’d ideally want:

    – someone who can liaise with all parties involved, preferably also with knowledge of the legal issues involved
    – someone with knowledge of UI design
    – someone with knowledge of corpus linguistics and natural-language processing
    – someone with insight into lexicographic matters
    – someone with knowledge of database design
    – someone who can code the web interface (preferably in HTML5 so you don’t have to write separate apps to make this available on smartphones)
    – someone who can deal with the print-on-demand part (though I have no idea how, you can actually write server-side scripts to generate PDF files, and there are APIs available to send these to print-on-demand companies automatically. but you need to build in checks to make sure Amazon doesn’t get flooded with dictionaries, etc)
    – beta testers
    – someone to coordinate all these people, a kind of visionary leader who can concentrate minds and make sure that people all over the world who are working on such a project work together effectively

    And, of course, you would need funds to pay for hosting etc. Maybe you would even want to incorporate the project legally, to make it easier to close content-sharing deals?

    I think it’d be hard to find volunteers who have enough spare time to spend on this project, no matter how great it may be. You could hire people with the necessary skills, but then the issue becomes, how do you pay them? In the beginning, you won’t be making any money, and even after a while, it’ll still be exceedingly hard to make any money at all. So that doesn’t seem to be a very viable idea, either.

    On the other hand, if any of the companies that are already active in the field (Pleco, Skritter, etc) are looking for a great way to gain market share, then I think you’ve just given them their next business idea. At least they would have the necessary skills and capital to make such a project succesful – and it’d definitely be a great unique selling point for their software, as well as providing them with possible future streams of revenue, such as selling customised print-on-demand dictionaries to those who still want to stick with paper dictionaries.

    Good stuff. I’ll be sure to keep an eye on this blog and the comments to see where this idea goes.

  6. I’m a high school student, but if anybody can even begin to make this happen, I’d be more than interested in helping in any way I can! I’ve had to study Chinese without a class for a little over a year, and I’ve found that my (many) dictionaries rarely help me. The lack of good dictionaries is frustrating to English students in China, too: One of my best friends is one of them.

    Maybe the Confucius Institutes would be interested in a project like this? The hardest part would probably be getting the idea to them; after that, they may have the resources to do it: All the Confucius Institutes are branches of HanBan (http://english.hanban.org/node_7719.htm), a division of the Chinese Ministry of Education that focuses on Chinese language education worldwide. That could be a little far-fetched, but a government might be one of the few entities large enough to put forth the organization, resources, and experts required…

    Anyway, my friends and I would love to see something like this! It would be so helpful!

  7. The answers are out there on the Web if you use a search engine — it’s just not as convenient as having them in a dictionary. For example, if you google [“grade on a curve” 成绩], with the language set to Chinese in the advanced settings of Google, you will find things like the following:

    (1) grade on a curve — 按成绩分布曲线打分,是美国的一种打分机制
    http://blue-salon.cn/dv_rss.asp?s=xhtml&boardid=19&id=586&page=1&star=3&count=33

    (2) Since everyone worked hard and they thought that I was going to grade them on a curve, there was a lot frustration that they wouldn’t properly be recognized for their work. (In fact I gave half of them A’s in the end.)
    每个人确实学得很认真,他们认为我会按照分布曲线来评定成绩,担心自己的最终成绩与实际付出得不到正确匹配,让他们很有挫败感(其实,最后有一半学生我都给了优秀)。
    http://xuedehui.com/15/ben-tilly/

    So it seems that something like
    按照分布曲线来评定成绩 or
    按成绩分布曲线打分
    would be what you are looking for.

    If there is no simpler way of saying it, then that could be because it is a concept that is not very familiar in China? One of those quotes says 是美国的一种打分机制. As a matter of fact, I was unfamiliar with the term “grade on a curve” myself, and had to look it up. I’m an Australian teacher, and in my experience we refer to “stardardizing the marks” rather than “grading on a curve.”

  8. Re: “Why isn’t “洗澡 xǐzǎo” listed in the E→C side? It’s an oversight, that’s all.”

    It’s not simply an oversight, in my opinion.

    The reason that E-C dictionaries say 淋浴 and not 洗澡 for “shower” is that those dictionaries are generally written for Chinese people who are learning English, *not* for English-speaking people. That sort of entry is not a problem if you are a native speaker of Chinese — you read 淋浴 and you immediately understand what “shower” means. A native speaker doesn’t need the dictionary to tell her that 洗澡 is another word she could use.

    淋浴 is used because it unambiguously means “shower”, whereas 洗澡 could be ambiguous. It’s not they that think 淋浴 is the most suitable word to use for “shower” in most situations.

  9. @Richard Warmington,

    Thanks for grade on a curve info! I’ll check that out and see if anyone can understand me when I say it 🙂 Oh, and I think you’re right that “It’s not simply an oversight, in my opinion”. It’s clear that that dictionary especially is for Chinese people living in England (all the money discussed is in pounds rather than yuan).

    @kencanau,

    I cannot emphasize enough how badly I do NOT want to know 11 words for spoon 😉 That’s actually exactly my point. How can I know which 1 word for spoon is the best in most situations in most parts of the country? The Super Dictionary could tell me!

  10. Re: “It’s clear that that dictionary especially is for Chinese people living in England”

    It’s not necessarily for Chinese people in England, I would think — more likely for Chinese people in China, a much larger market! 😉 The fact that they refer to British currency is due to its being a “collaboration of scholars working in Oxford, [former British colony] Hong Kong, and mainland China … and is based on research in both the Oxford English Corpus and the LIVAC corpus from the City University of Hong Kong”

  11. Excellent idea. I’ve been thinking about this recently as well, but not in as much detail. This should be done and I think it’s possible. I really would say we should get the ball rolling as soon as possible, but it seems like someone is going to have to stump up cash up front for the system to be programmed. Crowd sourcing the content is one thing, but the whole interface and backend is going to need some hardcore coding first.

  12. Excellent idea. I especially like the idea of leveraging the information contained in the searches themselves, as an indicator of what is useful and not, and the underlying regional differences.

  13. One resource that can give you *useful* definitions is Chinesepod’s Glossary.

    For instance, if you look up “shower”, you get example sentences starting with the following six:

    他去洗澡了。(He went to take a shower.)
    我没洗澡。(I didn’t shower.)
    你要洗澡吗?(Do you want to take a shower?)
    好的,我去洗澡。(OK, I’ll go take a shower.)
    我每天都洗澡。(I take a shower every day.)
    听不见,我在洗澡!(I can’t hear. I’m taking a shower!)

    But my favourite example is
    哎,穿着航天服真不舒服。一到空间站,我就去洗澡。
    (Agh, wearing these astronaut suits is really uncomfortable. I’m taking a shower as soon as we arrive at the space station.)

    The meaning and pinyin for each Chinese word in the example sentences is given via mouseover.

    http://www1.chinesepod.com/tools/glossary/entry/shower

  14. I think that many of the problems with bilingual dictionaries are due to the fact that they have been targeting Chinese native speakers. This would help to explain why, as Albert complains, “they give pinyin for the headword, but leave it out for the usage examples.”

    The reasons for the focus on Chinese people up to now are, I assume,
    (1) China’s economic development preceded the current wave of fascination with China, and that development necessitated interaction with the world outside China
    (2) You could sell more dictionaries to Chinese people than to foreigners

    But I would think that dynamic is changing as the number of people learning Chinese burgeons in response to China’s economic growth. I think Albert has now identified a need for better dictionaries for a growing sector of the dictionary market: people learning Mandarin.

    A similar scenario unfolded for the Japanese language, as learning materials matured over the years following Japan’s growth in the ’50s, ’60s, ’70s and ’80s. I think this will happen for Mandarin as the market demands better Chinese language learning tools.

  15. Producing a dictionary involves a lot of drudgery. Dictionary pioneer Samuel Johnson famously defined a lexicographer as “a harmless drudge.”

    How does one motivate “a team of experts” to “correct and enhance the data that are coming from all the common users”? Much of that work would be uninspiring. And the sheer volume of such work should not be under-estimated. I speak as one who has spent literally thousands of volunteer hours on CC-CEDICT since 2008, and made only a very humble mark on that dictionary.

    I think that the CC-CEDICT project works by focusing on an interesting part of the task — getting the English definition of Chinese terms right. The CC-CEDICT editors see, again and again, how other dictionaries have copied each other, given incorrect definitions, and omitted commonly-used words. (One editor recently added 包袋 “bag”, which has 175 million Google hits, but is not found in other dictionaries.)

    The process of rethinking and rewriting the definitions of Chinese terms is *creative*, and it’s fun showing up other dictionaries! 😉 And in the process one provides the learning community with a freely downloadable resource.

    However, some other tasks in Albert’s project — like assigning part-of-speech tags, or screening errors from “common user” contributions, or identifying typos in a QQ-chat-transcript corpus — might be less creative and less edifying, yet require a high level of expertise. Money might need to change hands.

  16. Here’s an idea. I don’t know a lot about how Wiktionary works, but you might consider linking your project in with Wiktionary’s infrastructure. They already have Chinese definitions of English words. As an example, if you go to

    http://en.wiktionary.org/wiki/shower
    and then the “Verb” section,
    and then “Translations — To bathe using a shower” (click on the “Show” icon)

    … you will see that their first definition of “to shower” (in the Mandarin section) is, in fact, 洗澡.

    Wiktionary may not currently be so useful for finding the right word for “go”, “Velcro” or “spoon” though. But I guess there is nothing stopping contributors from improving those entries.

    Well, maybe there is — I wouldn’t have a clue how to contribute. When I have tried to correct Wikipedia articles in the past, I found that the first hurdle (working out how to contribute) was intimidating.

  17. @Richard Warmington RE: ChinesePod Glossary,

    I would LOVE to see the ChinesePod Glossary made more widely available and included in the Super Dictionary. They produce such great quality work. The problem is there is not “one stop shop” for all the answers. You have to google some stuff, go to ChinesePod for some stuff, go to Wikitionary, that dictionary, etc.. I’m trying to find a solution to the deeper issue. But, as you said, it might need some money. But there are people with money in the world. The problem is: there’s no real concrete way to MAKE money off this project. Too bad that’s the guiding force behind most decisions. But it is. Oh well, a guy can dream, right?

  18. @Daan,

    Wow! That’s EXACTLY what we need…maybe. Look at this:

    “A team of computational linguists at Carnegie Mellon University led by Jacob Eisenstein and Brendan O’Connor has used geocoded tweets to build maps of regional language use across the United States.”

    There are problems, of course, with the regional thing. But as for frequency and new slang it’s great! Much more valuable as a linguistic barometer than as an emotional one, in my opinion.

  19. I’m afraid I have nothing of substance to add, but I still want to say how awesome I think you are, Albert. And your post as well, of course. When I looked at the title, I thought there would be lots of things to add/comment, but I really think you’ve captured the core problems neatly. Great!

    Slightly off topic, I would like to say that Longman should be a model not only for C/E dictionaries, it should be a model for C/C ones as well. Longman is awesome for foreigners (and natives). See more here:

    My article about Chinese-Chinese dictionaries

    Forum discussion about the lack of usable C/C dictionaries

  20. Albert,

    I like your idea and your noble spirit. Here are a few things that came to my mind after a quick read of your post and the many comments.

    1. A project of this scope and nature needs to be financed by someone with a very deep pocket and an equally noble spirit. I believe Warren Buffet or Bill Gates can be talked into it. With the funding, there will be no problem finding the needed computer and language talents, lawyers and managers. Everyone else are still welcome to contribute volutarily.

    2. It seems no mention has been made of audio capability. This is a must for an on-line dictionary for beginners.

    3. Will the users be permitted to copy the contents at will? If so, anybody will be able to make his/her own dictionary (in eBook or print format). And if this is to be an altruistic project then why not?

  21. This could be great
    But just in my own opinion: I think the method is taking to the wrong direction. After you’ve frustrated with a, say, Chinese/English dictionary for a certain amount of time, you have to use a Chinese dictionary for native speakers to go to the next level. I don’t think any C/E dictionary would ever be good enough, just educate yourself using a Chinese one.

  22. @Evi — “Maybe the Confucius Institutes would be interested in a project like this?”

    It would be a shame if they were involved, in my opinion. They promote only limited study of Chinese language. Hanban has provisions to allow the withdrawal of funding from any institution that would dare to teach traditional characters (which were used by Confucius himself, of course — so ironic!). Likewise, they will not admit other kinds of diverse understanding of the Chinese language. You should teach only Putonghua, and not literary Chinese (the language of the Analects), nor Hokkien, nor Cantonese. And the content of learning materials must be compatible with the views of the CCP. Certain topics are not to be discussed. The Confucius Institutes “exist for the express purpose of letting foreigners understand China on terms acceptable to official China.”
    http://www.chinaheritagequarterly.org/articles.php?searchterm=026_confucius.inc&issue=026

  23. I’ve been thinking about this idea for about 4 years now. Thanks for this post.

    I’ve gradually been gaining the experience needed to implement it, too.
    I’ve studied lexicography.
    I’ve studied corpora.
    I’ve taught myself programming.
    I’ve learned database systems.
    I have audio content for the dictionary.

    I strongly believe in creating a task based approach, similar to what you wrote about. It makes it easier to get others involved and sets clear goals for the project.

    I have been working on a project that fits much of what you want to do on the C-E side (unfortunately, I haven’t had time to work on the E-C side yet). The short of it: it’s a corpus backed dictionary with all that it entails. It’s not out yet, but part one will be ready in the next few months. I don’t plan on rolling out all the features at the same time, but gradually, I will create different features similar to what you mentioned.

    By the way, there is a google group called Chinese Dictionaries ( http://groups.google.com/group/chinese-dictionaries ).

    I do plan on making simple versions of the dictionary available to the public via a permissive license. So you’d be able to get a simple copy with entries, pronunciation, definitions, but then I’d want to sell more sophisticated versions with examples, collocations, frequency data, etc. to app developers and website developers.

    About Skritter’s dictionary: I haven’t talked to the Skritter people about this, but if they modify cc-cedict to create their dictionary, they have to make a version of their dictionary available to via the same license as cc-cedict (cc-by-sa). So it should be available somewhere.

    If anyone is willing to lend their skills to my project, please contact me (stevendaniels at lingomi.com) and let me know.

  24. “LONG ARTICLE”. But well worth reading! There are some good solutions offered. As a learner at a relatively low level, I need precise C-E and E-C dictionary support more than anything, to move ahead.

    I will watch for your createspace book.

  25. Such a great idea! I’ve just started my first year in computational linguistics, after having learned Chinese for about 1 year, and this is just the kind of tool I felt was missing during my studies. My knowledge in natural language processing etc have yet to mature enough for me to be of any help in this project, but I’m still very interested to follow up on this project.

    One idea I’ve had is whether or not you could use the huge amount of fan made subtitles out there (for example, opensubtitles.org) to make a collection of actually used example sentences, since they are time coded and translated into many different languages. Not that I don’t appreciate communist propaganda in my dictionary, but I feel like there should be at least an alternative. 😉

  26. @Jimmy

    The subtitles idea is a great idea. It would be a good help in building an E-C dictionary (something I don’t have a great strategy for). One thing I’ve noticed about Chinese subtitles for TV shows is they often don’t translate colloquial phrases very accurately. Sometimes they are just guessing at the meaning.

  27. I just got the “ABC English-Chinese/Chinese-English Dictionary” (published one year ago) as part of an upgrade to Wenlin 4.0. For “spoon”, they have just one word: 勺子. However, the entry is cluttered because ABC too couldn’t resist catering to the Chinese market:
    spoon [spuːn] {BE/A} n.c. sháozi 勺子.

    A native English speaker would find the pronunciation information [spuːn] superfluous. “BE” indicates that “spoon” is part of the “Basic English” vocabulary set. That, too, is unnecessary for most Chinese language learners. The “A”, however, indicates “Grade A HSK wordlist”. That seems bit strange, since “spoon” is not a Chinese word!

    For Albert’s example “go”, there are 41 different senses listed in Encarta English Dictionary, but their #1 sense — “depart” — is simply not in ABC’s definition of “go” as far as I can see! ABC does mention , but only for the sense of in 我的表不走了– “My watch isn’t going.”

    The example sentences in the ABC E-C Dict all have pinyin in the print version of the dictionary, but within Wenlin this just becomes more clutter, since Wenlin provides pinyin on mouseover.

    Speaking of “print version”, I am not sure why you place emphasis on the printing of your Super Dictionary, Albert, when an electronic version would be much more easily searchable, as well as much lighter to carry around (even an unabridged edition of it).

  28. Since I have already done a lot of the “hardcore coding” required for a dictionary like this, and since I’ve wanted to do something like this for a while now, I’m going to go ahead and kickoff the project on a website I’m working on called 3000Hanzi.com

    Here’s my mission statement:
    The world needs a more humane Chinese dictionary.
    Here’s my goal:
    To create an E/C and a C/E dictionary.

    Here’s a blog post for details:
    http://3000hanzi.com/blog/the-chinese-dictionary-for-the-future-the-3000-hanzi-dictionary-project/

    Here’s a link to the mailing list:
    http://eepurl.com/gVBAL

  29. If you had a lot of power/money/influence, what you could do to solve the weighting problem is send forms to a few thousand Chinese people living in Beijing and give each household a task: rate a list of 20-30 words for their usefulness in daily life and give examples of ordinary usage.

    And then do it in lots of other regions.

    And then put it all in the database with the pinyin. You would probably need to hire a lot of people to do that.

  30. @James

    There are serious problems with asking regular people for usage examples. Most people don’t think about the idiosyncrasies of language on a regular basis. If you asked 10 people what the difference between and 觉得, you’d get lots of different answers. Some people would swear one has to be used in a certain situation, others would disagree.

    Also, at any given moment, someone doesn’t have access to the entire usage of a word. If you asked an English speaker what theword set means, they’d be able offer a few definitions, but it wouldn’t be close to complete. Even aggregating answers from a lot of people is likely to miss some nuances.

    Lastly, people are poor at writing examples. If you ask someone to write a sentence it will be less natural than if you just used a corpus to look for the usage. The Tanaka Corpus (Billingual Japanese sentences), which was created by asking students to write examples sentences displays this phenomenon: the sentences aren’t that great. ( even still, I like the tanaka corpus and a similar website called Tatoeba.)

    Lexicographers usually look at 100s or thousands of sentences from a variety of siurces to understand the usage of a word. It takes some training to figure out what to look for, but overall, it’s much more efficient.

  31. Bravo Steven for taking the lead to act. I think it will help some people decide to jump on the bandwagonif if you could add the following to your blog post:

    1. One complete example of an E-C and a C-E entry.

    2. A better idea of what’s involved and how much time/effort is expected of the people who want to help.

    3. The time estimate for completing the 3000 word dictionary.

    If these 3000 words are properly selected, they will allow the student who has studied them all to be able to read a Chinese newspaper, and this “humane” dictionary could become a required reading for the Chinese classes offered in colleges. 🙂

  32. Yes, I can identify with the frustration of dictionaries that don’t have the words you’re looking for! (This is particularly frustrating for proper nouns).

  33. I’ve just come across this blog, and I can exactly relate to the feelings shown on it. Now, what’s the state of affairs? any improvements?
    I think this idea should be boosted by means of worldwide-spread publicity on forums, blogs and institutions interested.
    thanks in advance!

  34. @George: I’ve commented (and blogged) about this idea before. I’ve also done a lot of work towards making my similar vision a reality. At this point, I’m about 80-90% done with a project. I’m hoping to get it past the test phases and into production early 2014.

    I’ll provide news when I have some.

  35. Hi! I just came across this post and can perfectly relate to it. I’ve always thought that having the tech nowadays we can use, especially internet, it must be possible to achieve it in a reasonable period of time, once every eager blogger and forum user is readily involved and willing to work on the project.
    Hope to hear news soon and that this idea don’t vanish.
    Cheers.

  36. Hi, thanks for a fine article.

    You’ve written:
    Re: “Why isn’t “洗澡 xǐzǎo” listed in the E→C side? It’s an oversight, that’s all.”

    Well, it might be considered overisght if it were but one or two such situations. I’d been using Oxford Pocket Chinese Dictionary and I was always wondering why most of the words tell differently in E/C side and in C/E side. I’ve been making stupid mistakes because I was not able to find correct word. Once I have consulted several (more than 30) definitions with my native speaker teachers and we have found out that 75% of the definitions in E/C side were wrong according to them, whereas those on the C/E side were all correct.

    Since then I avoid using Oxford dictionaries.

    I agree with Richard Warmington when he says that “the reason that E-C dictionaries say 淋浴 and not 洗澡 for “shower” is that those dictionaries are generally written for Chinese people who are learning English, *not* for English-speaking people”, but there yet is another reason why in the same dictionary the two sides do not match. If you look at the names of the people who prepared one side and those who prepared the other side in this dictionary of mine, you’ll find that they are two completely different groups. Not a single person has worked on both sides. Thy would’ve needed omniscience to guess the other groups linguistic intuitions. This is the reason No. 1.

  37. @Maciej,

    Yes, you’re right. It’s a little more than an oversight. It’s an oversight in the whole system of putting the dictionary together. And Richard is right too.

    Now that I have a smart phone and Pleco, I can’t think of why I’d ever use anything else. Except that I don’t like how the battery can die on my dictionary 🙂

  38. I’ve been designing a dictionary like this for years, and I’m probably going to release it in a year or so, though my approach to the database is different, and my decisions on term addition differ. Also, there are features not covered in this article that the dictionary I’m developing does (e.g. whether it is archaic, whether it merits a sample sentence, etc.).

Leave a Reply

Your email address will not be published. Required fields are marked *