RIDICULOUSLY LONG ARTICLE AHEAD: I’ve had this stuff in mind for about 6 years and I finally just needed to get it off my chest. I don’t imagine future posts will ever be as long as this.
For a printer-friendly version in MS Word, click here:[Download not found]
The Future of Chinese/English Dictionaries
Albert Wolfe, October 2011
This blog is usually for us “Old Hundred Names” (common folk) learning Chinese, but this article is different because I’m appealing for help and trying to cast a vision for the future. Unlike my usual posts that have something immediately applicable for learners of Chinese, this article is one step removed and talks about a gap in the materials, something we learners of Chinese need but can’t get for ourselves: a complete and useful Chinese-English/English-Chinese (C/E) dictionary.
In this article I’m going to describe the ideal “Super Dictionary” of the future in hopes that people who can bring it about (which includes us Old Hundred Names, as you’ll see) can make it so.
This article is inspired by:
- Six years of frustration with the currently available C/E dictionaries.
- MDBG: the closest thing I’ve seen to the community project required to solve the problem.
- The Longman English Dictionary: a model for the sort of information a foreign language learner needs in a dictionary.
- The advent of CreateSpace: high-quality, super cheap, print-on-demand self publishing.
- Skills 4, 9, and especially 10 from The Institute for the Future’s “Future Work Skills 2020 (pdf)” report.
Missing Entries and Definitions (E→C)
Missing Entries and Definitions (C→E)
Missing corresponding words (C↔E)
E→C: Can’t find the word everyone uses
E→C: Too Many Choices (and Different Dictionaries Don’t Agree)
C→E: Too Many Definitions
The Problem of Pages
What We Need
How We Can Get It
We learners of Chinese turn to dictionaries to answer the following two questions:
- What does (Chinese word) mean in English?
- How do you say (English word) in Chinese?
Problems trying to find the answer to question number one, going from Chinese to English (C→E), are less common than problems with question number two. But there are still a disturbing number of times when the dictionaries we consult aren’t complete enough to give either the appropriate definition of a Chinese word or the word itself. There are also some English words that do not appear as headwords in the dictionaries, making an E→C search frustrating.
The majority of our problems with C/E dictionaries come from trying to answer question number two, going from English to Chinese (E→C). We cannot trust that the Chinese definitions given in our dictionaries are the appropriate ones to use. Furthermore, cross-checking multiple dictionaries often yields different words rather than confirmation of certain words. So we are at best insecure about our word choices and at worst secure but wrong. We need a more complete and more useful dictionary.
Missing Entries and Definitions (E→C)
- Common words and phrases. “Velcro” and “grade on a curve” are missing from every dictionary I’ve ever seen. (Velcro has been added to MDBG but I still don’t know how to say “grade on a curve”, and I’m even required to do it at my college here inChina. They just explain it in terms of the number of As, Bs, etc. I’m allowed to give.)
- Special / technical terms. For example, medical terms seem to be getting slowly added to the online dictionaries. I needed a new Albuterol inhaler and had to figure out how to say it without the aid of a dictionary. (Here’s the story of how I found out and then added it to MDBG).
Missing Entries and Definitions (C→E)
- Common words or definitions. “Yàobù” 要不 can mean: “How about (we do something)…?” That particular meaning doesn’t appear in any of my paper dictionaries and only appears in sentence examples at Nciku but not in the definitions. (It has since been added to MDBG.)
- Proverbs / idioms. One of my English student asked me how to say “yīn ài chéng hèn 因爱成恨” in English (so we know it’s commonly known and used, not just some ancient literary phrase). I didn’t know so we looked it up. None of the online dictionaries nor my paper dictionaries including a specialty proverbs/idioms dictionary had it, yet it got over 2 million hits on Google. (I’ve since added it to MDBG.)
- New slang. Of course, every language is evolving. That’s why adding new slang like “gěilì” 给力 is essential for us to keep up with the modern usage of Chinese.
Missing corresponding words (C↔E)
Paper dictionaries have a special problem that online dictionaries don’t have to deal with because of the nature of online searches.
I’m going to pick on the Oxford Minidictionary (affectionately know as “Chubby“) for a moment. Here’s the entry under “shower“ in the E→C side:
And then it moves on to rain, which we’re not interested in right now.
Now the problem is that in the C→E side you can find this entry:
“洗澡 xǐzǎo vb to have a bath/shower”
So, why isn’t “洗澡 xǐzǎo” listed in the E→C side? It’s an oversight, that’s all. This is just one of many examples of inconsistent internal cross-references that occur in every paper C/E dictionary I’ve ever used (see another example).
Even though incomplete dictionaries are frustrating because we can’t find a word we’re looking for, the problem of usefulness is much more urgent. When we want to find how to say an English word in Chinese, even if the dictionary contains an entry for the English and Chinese, we cannot trust that the word we get is the right one.
E→C: Can’t find the word everyone uses
The Problem of “Shower”
First of all, “xǐ zǎo” 洗澡 is definitely the most-used word for “to take a shower / to bathe” all over the country. I can even remember a joke (one of the few I understood) from the Spring Festival Variety Show (春节晚会) a few years ago when one of the actors used it. There are other ways to say it, and the noun and verb form are different (as confirmed anecdotally by this post), but I’m convinced that “xǐ zǎo” 洗澡 is the word to use.
So we need that to be indicated in the dictionaries. Remember, the chubby little Oxford Minidictionary only gives “xǐ línyù” 洗淋浴and I’ve never heard that used once. My shower post on this blog confirms that xǐ zǎo” 洗澡 is missing from other dictionaries. Yet it is the word everyone seems to use for something they do every day. That makes the dictionaries useless for answering the question, “How should I say ‘shower’ in Chinese?” And there are many other examples just like “shower.”
The Problem of “Go”
My friend Brad asked me a question his first month inChinathat perfectly illustrates another aspect of the learner-unfriendliness of our dictionaries.
“Hey, how do you say ‘go’? You know like, ‘I’m going now.’”
I explained “go” in English can be translated into many different words in Chinese, but in this situation “zǒu” 走 is the best.
Lenny gives “qù” 去 first (which would be used in “I’m going to China”). Next is “líkāi” 离开 which would work for Brad’s example, but isn’t as common as “zǒu” 走, which appeared halfway down the list under “I must be ~ing”. The translation is good: “wǒ děi zǒu le” 我得走了. But there is nothing to indicate which of the four words is “go.”
Chubby gives almost a full two pages to “go” and various collocations of the word, arranged in alphabetical order rather than by usefulness or frequency. First is “go across” chuānguò 穿过 and then “go after” (physically) zhuī 追. “Zǒu” 走 makes a few appearances, but is buried among such nuggets as “go off (when talking about food becoming bad)” biànzhì 变质.
Online dictionaries don’t help much either. A search at MDBG for “go” gives 100 results (on the first page). “Qù” 去 is number 9 and “zǒu” 走 is number 15. A search at nciku shows去 first (no pinyin until you hover your mouse) and then “zǒu” 走 after about 20 other entries.
E→C: Too Many Choices (and Different Dictionaries Don’t Agree)
The Problem of “Spoon”
Finding out how to say “spoon” is hard too. Just to clarify, this is not a Western invention that is rare in Chinese restaurants (like the fork). I’m not demanding the Chinese come up and agree on a word for a strange foreign object. I’m talking about the sort of spoon that every single restaurant inChinabrings every customer automatically with soup or fried rice, and has been doing so for centuries.
Let me show you the entries for “spoon” in the various dictionaries I’ve got lying around:
- Oxford Minidictionary: sháozi 勺子 / chízi 匙子
- Langenscheidt: sháo 勺
- Commercial Press: chí 匙 / tiáogēng 调羹
- Readers of this blog contributed 6 more: tāngchí 汤匙 / piáogēn 瓢根 / piáozi 瓢子 / sháor 勺儿 / chígēng匙羹 / qǐgēng 匙羹
- Total ways to say spoon: 11
The little “zi” 子 and “r” 儿 endings may not be as important as the other differences. But still, how should I, a learner of Chinese, go ask the waitress to bring me a second spoon at the restaurant tonight? And what’s the difference between all of the choices? Are some more “correct” than others? Judging from the comments, the differences seem to be largely regional. So which word is most likely to be understood no matter where I go inChina? These are questions that I can’t find the answers to.
The Problem of “Bus”
Sometimes, the differences between the words are more about usage than region. Take “bus” for example. I’ve heard all the following words used for “bus” in the same general area:
The word you choose depends on the situation you’re talking about. For example, “bus station” uses qìchē 汽车 but “city bus” is usually gōngjiāo chē公交车. To be truly useful, the dictionaries need to explain the usage differences and give example phrases or sentences to illustrate the differences.
C→E: Too Many Definitions
Especially in online dictionaries, where no editing choices need to be made to save pages, there are often just too many definitions for a single Chinese headword. As a result, the guys at Skritter have started trimming down MDBG’s CC-CEDICT database, which Skritter uses for their excellent writing training site. They’ve ended up creating their own version of the MDBG dictionary—one that they feel is more useful to learners.
jiǎ 甲: first of the ten heavenly stems 十天干[shi2 tian1 gan1] / (used for an unspecified person or thing) / first (in a list, as a party to a contract etc) / armor plating / shell or carapace / (of the fingers or toes) nail / bladed leather or metal armor (old) / ranking system used in the Imperial examinations (old) / civil administration unit (old)
And here’s Skritter’s definition:
jiǎ 甲: one; armor (1st Heavenly Stem)
For beginners, I think Skritter’s is much more useful. I would suggest adding “nail (finger or toe)” to Skritter’s as well. But the point is: Skritter has helped the learner sort through the huge volume of information by simply removing what they feel is less important.
Just to give MDBG a break, the goal of MDBG and the CC-CEDICT database behind it is to provide a one way, Chinese-English translation tool that provides the most complete English definition list possible (including all the ancient meanings). But for learners of Chinese, it’s not as useful as we’d like.
There is a solution that can meet both goals: weight the definitions for usefulness rather than removing them. I’ll explain how in the next section.
But sometimes the definitions are so out of date they’re laughable and need to be edited rather than just weighted. For example, compare the following:
(Skritter) tāo tāo bù jué 滔滔不绝: (saying) talking non-stop; gushing; torrential
I think it’s obvious that the Skritter dictionary’s definitions are more appropriate for this century. In this case I would recommend changing the MDBG definitions rather than just weighting them.
The Problem of Pages
It would be impractical to include every Chinese and English word in a printed dictionary (see what happened when a UK student printed some of Wikipedia). Editors have to be selective. But before that selection can happen, all the data and definitions need to be compiled.
I understand that printers of dictionaries such as my old favorite the Oxford Minidictionary have to pick and choose what to add. But I have the distinct feeling that those choices are not made very scientifically. For example, consider this entry that made it into the 633 pages of the little dictionary:
showjumping n qímá yuè zhàng yùndòng 骑马越障运动
And yet headwords such as “similar” are missing.
How are the decisions made about what to include in paper dictionaries? Could it be that one of the editors, Boping Yuan or Sally Church, was interested in showjumping? I’m interested in a more scientific approach involving sorting huge amounts of data for frequency, popularity, and usefulness to inform the choices of what to include or not.
To do that, we need to make the core database for the Super Dictionary an online resource. Online dictionaries have the advantage of virtually limitless space that anyone on the planet can access at any time. At the time of this writing, there is still no single database that contains all known words and phrases.
Of course, there’s no way to ever have a 100% complete Chinese-English dictionary because of the nature of language change. There will always be new slang and new terms coming out. But we haven’t even got the old ones all compiled into one place yet.
Once the Super Dictionary is reasonably complete, the task becomes sorting and arranging the information and definitions into the most useful, learner-friendly format possible. Then various printed books can be produced if there’s interest.
So how can we 1) get that information, 2) arrange it in order of usefulness? I’ve got a few ideas.
- I’m a horrible business man. (As proof: there are no advertisements on this blog and I give all my music away for free.) I don’t have any plan for how I, personally (nor anyone else, for that matter), can make money off the community project described below. I don’t claim any ownership of the ideas, data, information structure, or processes described below. All I care about it getting the information to the masses (and going on record as being the one to propose this project). If I can participate in the project in some way, I’d be delighted.
- I’m a volunteer editor for the MDBG dictionary (although I’ve been pretty uninvolved recently).
- I’ve recently published a book with CreateSpace at my own expense (coming out soon).
- I have no other affiliations with any of the other individuals or companies mentioned below. None of the individuals or companies mentioned (including MDBG and CreateSpace) have agreed to participate in a project like the one described below, nor have any individuals or companies expressed any approval or endorsement of the ideas presented in this article. They are simply cited as examples of popular online services that could be included in a community project in the future if they agreed to do so.
What We Need
Look at all the info the Longman Dictionary give English learners who look up “drink”.
There are more definitions and sentence examples that I cropped off.
This serves as an excellent model of a very useful learner’s dictionary. We need something equally useful for our Chinese/English Super Dictionary.
We need pinyin (and variations for the pinyin) for every word and sentence example. Many dictionaries neglect the pinyin for sentence examples. I’ve never quite been able to figure out why they give pinyin for the headword, but leave it out for the usage examples.
2. Part of Speech
“Chinese words don’t have parts of speech” is a common myth. Sure, some words function as verbs, nouns, and adjectives, but that’s true in English as well. And sometimes there is a clear difference in Chinese. For example, “héshì” 合适 (meaning “suitable”) usually functions as an adjective and “shìhé” 适合 (meaning “to be suitable for”) as a verb.
Chinese parts of speech may not directly correspond to English ones but we need them labeled anyway. For example, Chinese particles like “le” 了，”ba” 吧，”ne” 呢, etc. don’t have English equivalents. But we still need them indicated as “particle” (or something) in the dictionary.
Adsotrans seems to have the most data about parts of speech at the moment, but it still needs work and could benefit from a team of users and editors constantly improving it in the way I’m suggesting. For example, at the time of this writing the two entries for “jiǎ” 甲 (meaning “1st, finger nail,” etc.) list the parts of speech as “NOUN” and “HEAVENLY,” respectively.
3. Extra Info about Part of Speech
Learners of English need to know if a verb is transitive or intransitive, how the verb is conjugated into different tenses, and whether a noun is countable or uncountable. Learners of Chinese need to know, for example, what measure words are associated with which nouns, and which category a verb falls into (“stative, activity, achievement” are the three categories given by Claudia Ross).
4. Frequency Data
The Longman entry for “Bicycle” has this icon meaning it’s in the top 3000 written words yet “bike” has this icon indicating it is in the top 2000 spoken words. So English learners can assume that “bike” is less formal and “bicycle” is more formal. Now the decision of which synonym to use in which context has been made easier: use “bike” when talking to your friends and use “bicycle” when writing a business contract or even a note.
Something like that would certainly help us sort through the ocean of synonyms in Chinese.
5. Sentence Examples
7. He is fond of a dram. 他喜欢喝一点酒。(Jukuu)
1. We can’t solve these problems through arranging dinners and parties. 这些问题不是靠吃吃喝喝就能解决的。(nciku)
Not only do these example fail to use the word “drink,” notice that pinyin is not given at JuKuu and is only available on mouse hover at nciku (see number 1 in this section).
There is a new community project called tatoeba that links sentence examples from a bunch of languages together. They seem to be much more on the right track. The first three sentence examples from this search for “drink” give excellent usages of the word “hē” 喝 and provide pinyin and hanzi.
6. Other Info (not pictured)
In addition to the examples in the above image from Longman, we learners of Chinese need the following information in our Super Dictionary:
- Regional differences in usage and pronunciation (蜗牛 wōniú / guāniú [taiwan] = snail).
- Category and meta tags (chuānghu 窗户 = window [building, vehicle]; chēchuāng 车窗 = window [vehicle]; shìchuāng 视窗 = Windows [computer operating system]).
- Formality tags (qīzi 妻子 = wife [formal]; lǎopó 老婆 = wife [informal]).
- Traditional / Simplified variants for characters (中国 [simplified] / 中國 [traditional]).
- R-hua (érhuà) 儿化 variants and whether that changes the meaning (tóu 头 = head / hair; tóur 头儿 = leader).
- HSK levels and TOP levels for learners who are preparing for those tests.
- Literal breakdown of hanzi (fēijī 飞机 = airplane [fly machine].
- Radical breakdown of characters (机 = 木 + 几) and what each radical means (if anything).
MDBG and nciku have many of those things. But regional differences, frequency (or even popularity) ratings, and all the parts of speech and category meta tags are still either completely absent or incomplete.
It’s too much to ask anyone individual, or even company to do all this work. That’s why I propose a system of uniting everyone together to collect and constantly improve on the data we all need.
How We Can Get It
Because of the complexity of the project I’m envisioning, I’ve created a little diagram to use as a guide for discussing the various points.
1. Users register for Super Dictionary
Currently, the main online dictionaries don’t require registration for use. I think that’s fine. But there should be an opt-in system for allowing users to help improve the dictionary. As long as users understand the reasoning behind the registration and have assurances that private data will be protected, I think people would be willing to pitch in to improve the world of Chinese learning.
NOTE: This diagram is only showing what Chinese learners would do. But getting native Chinese speakers (learners of English) involved is essential too. Maybe each user’s first language (and region) should be part of the registration process to round out the data. I haven’t thought that all through yet.
2. They are associated with a region
The regions on the map (image credit) are simply an example. Rather than choose pre-determined regions and ask users to “pick the kind of Chinese you want to learn,” I think it would be better for users to agree to say where they are learning Chinese and then let the computer extrapolate regions from the data later. For example, if users say which cities they’re in, the computer can then look for patterns and see which cities (and then larger regions) use which words.
Also, for users who are not in a Chinese-speaking environment, they should indicate which area they are most likely more exposed to. For example, if my wife were fromBeijing, I’d list my region as “Beijing” even though I might be living inDenmark. If my Chinese teacher atCaliforniaUniversityis fromGuangzhou, I’d put “Guangzhou” as my region.
It wouldn’t be completely useful all the time, but the important thing would be to start collecting the information and then let the computer start looking for patterns.
3. Their usage data goes into to the database
Raw Popularity Weighting
The system should watch search queries and the user won’t have to do anything different for the computer to start learning. For example, I’m in Guangzhou and I search the dictionary for “pāituō” 拍拖. The computer should remember that someone searched for “pāituō” 拍拖 and also that the person was in Guangzhou.
Then, the user can help even more. When I see the results for “pāituō” 拍拖, I realize that the context that I heard it in was “to court” or “to date”. So I can click a little link on that definition that tells the computer: “This is the meaning I heard for that word.” So the computer can start to figure out that “to court” is a popular definition for “pāituō” 拍拖.
This could also help show where new phrases start from. For example, I’ve heard that “pāituō” 拍拖 is being used more and more all over the country. A system of regional tracking could corroborate the hypothesis that it originated in Cantonese-speakingGuangdongprovince.
Spoken vs. Written Popularity
It can go one step further by offering two icons, let’s say an ear and an eye . Now, if I heard “pāituō” 拍拖 in a spoken context (if it came up in conversation, for example), I click the ear icon above “to date”. So the computer learns that it should give that definition more of an oral weighting.
Conversely, when I look up “sell” in English, the computer makes note that “sell” has been searched for. Then I see the definitions and click the ear for “mài” 卖 and the eye for “shòu” 售. That means that “mài” 卖 is used more for oral Chinese and “shòu” 售 is more for written Chinese.
Of course, many times, especially when going from English to Chinese, users won’t know which word is used in more spoken or written contexts. That’s fine. You don’t have to click anything.
If a user searches for something that’s not in the dictionary (for example “velcro” or “gěilì” 给力), the computer will show “no results found” but should send a note to the experts so they can determine whether it should be added or not. According to this, nciku already has some of this sort of system in place but not as sophisticated as the one I’m proposing.
4. Experts add data common users can’t
Finding the experts will be the biggest problem. I think academic institutions and volunteers could be enlisted, but that’s its own issue.
Assuming for a moment there is a team of experts, they will serve mainly to correct and enhance the data that are coming from all the common users. The experts could be associated with a region as well to monitor how geography affects their decisions.
They would be able to add (or approve) tags to entries regarding any of the following:
- Written variations (traditional / simplified, alternative characters, etc.)
- Spoken variations (zhè / zhèi for 这; wōniú / guāniú for 蜗牛)
- Formal / informal register tags (qīzi 妻子 [formal]; lǎopó 老婆 [informal])
- Part of speech tags (héshì 合适 [adjective] ; shìhé 适合 [verb])
- Sentence examples that are useful, common, and not too long.
BONUS: Corpus Data
What we really need is some objective data for written and spoken frequency. Longman put together their own corpus to get frequency data.
Frequency of single Chinese characters (zì 字) in written material (like newspapers) has been done for a long time (at least since 1993). But we at least need research on the top 1000, 2000, 3000 multi-character words (cí 词) both spoken and written.
I’ve recently become aware of Jun Da’s corpus data that does take into account multi-character words. Perhaps if someone like Professor Hongyin Tao at UCLA or Dr. Richard Xiao at the University of Central Lancaster could be persuaded to join the team. Xiao compiled The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC), a collection of natural and scripted Chinese conversations and transcripts. Professor Tao was also heading up work on the UCLA Chinese Corpus. Work on both projects seems to have stopped in 2008. However, in 2009 Xiao and a few other editors published a Chinese dictionary with frequency data! I haven’t seen it myself, but according to the reviews on Amazon, it’s not as useful as it could have been (for example, it’s missing pinyin for sentence examples). But it’s a step in the right direction.
If Xiao’s, Tao’s and Jun Da’s data were combined (just to name a few) we could have a huge, useful bank of info that could provide guidance to the users of the Super Dictionary. Would they allow that? What’s in it for them to team up? I’m not sure what their goals are so I can’t answer that.
Also, the challenge of getting a spoken corpus will be great. I think QQ chat transcripts might be useful for that, but who’s going to opt in for sharing those?
5. Online / Mobile Services Use the Database
The Skritter guys had to do a lot of work trimming down the dictionary because the MDBG database they imported wasn’t exactly what they wanted. Now, if I want to start my own website I have to “reinvent the wheel” and start from scratch. Why not let the work that the Skritter guys have done benefit everyone in the future as well? If they’re willing to make their data available, they could provide weighting information (rather than removing things that someone might want to add back in later) so that the Super Dictionary is smarter about which definitions are more useful or important to them.
But it’s not just for services who’ve created their own dictionary (like Skritter has) to contribute back to the Super Dictionary. Let’s take Pleco and their iPhone dictionary as an example. The newest version of Pleco allows you to point your iPhone’s camera at any hanzi text and it’ll translate it for you (see cool video demo). So how about letting the Pleco users opt in to send that info back to the Super Dictionary? Then we’d start to see data about how many users scanned which characters and we’d start to know that those characters are at least in common use.
The Super Dictionary database would also benefit from knowing that there are 2,000 Skritter users who have such-and-such character in their vocab list, but only 20 users who have this other character.
It would need to be determined exactly how to use the data, but it could only help.
I just had a conversation with Ben Whately, co-founder of Memrise, and he said he was getting ready to import a dictionary database because they have more important things to focus their energy on than making a dictionary (they’re compiling some very exciting data that I hope to discuss in a future post). The problem he’s facing is whether to use Adsotrans (which is maintained by David Lancashire of Popup Chinese) or MDBG/CC-CEDICT. I told him I didn’t really think either was good enough on its own.
What he really needs is the Super Dictionary that includes both plus all the work that the Skritter guys and other people have done. And wouldn’t it be great if Popup Chinese also shared the data about which words they used in their learning materials? They wouldn’t have to make the learning materials public, just the data.
But would companies see that as helping their competitors? I don’t know. My hope is that if the Super Dictionary were available for free, and everyone were contributing to making it better and better, it would make new companies focus on offering new services and new tools rather than making their own dictionaries. Instead of a lot of different wheels being invented, we’d start to see a lot of new vehicles (and maybe some toys!) using the same wheels.
6. Paper Books Can Be Printed
CreateSpace has the following characteristics that make self-publishing a paper dictionary extremely desirable:
- Very high quality printing and many paperback book size options available.
- All you have to do is upload a PDF and they make the book. You buy the first book (called a proof) for about $5-7 dollars and then the book is available on Amazon for the world.
- It’s completely free for the author to set up a book. Various add-on options are available for a price, but CreateSpace make their money each time the book sells rather than when it’s set up. Even the ISBN is provided free by CreateSpace. You only pay for the book itself when it’s printed.
- It takes about one week from the time you upload the PDF to the time the book is ready to print. That means if you want to update the book, you upload a new PDF and a week later, the new version of the book is ready. You can do that as many times as you want.
- The book goes directly on to Amazon so that others can benefit from it as well. You set the price of the book (above a certain minimum price based on the cost of production). CreateSpace prints and ships the book within 24 hours of an order made on Amazon.
- An infinite number of dictionaries with various options can be printed as long as there’s a PDF for each one.
That means that as long as there’s something built into the Super Dictionary that allows users to select various options and output a PDF, the information doesn’t have to stay locked into an online format. The options could include the following:
You would be able to choose how many headwords you want in the book. You would also be able to choose how many definitions you want for each word. For example, if you just want a little travel dictionary, you might go with only the most frequently used and popular words and definitions. In this case you’d have the definition for “jiǎ” 甲 be only “one; armor; nail (finger or toe)”. But if you’re going to be doing some sort of scholarly project, you might want the full list of definitions.
You could also produce a special dictionary for certain regions. In other words, if you’re going toTaiwan, you could have it just use the words that are commonly spoken inTaiwan. It would save on pages and then you’d be reasonably certain that you’ll be able to say what you want to say for your area. You could also choose to have traditional characters, simplified, or both.
If you want to save more pages, you could opt out of having sentence examples included. If you did want some, you could choose how many sentences examples to include and only use ones that have been marked as popular or useful.
If you’re a learner of Chinese, you’d want to have pinyin for everything (including sentence examples). But you could choose to have the Chinese side ordered by hanzi or pinyin (most dictionaries order by hanzi but that’s not necessarily the easiest thing for learners).
If you’re a learner of English, you might want to save pages by eliminating pinyin entirely. But you might appreciate English IPA included. That could be one of the options.
I’m not aware of anything like the proposed system for the Super Dictionary.
Just to summarize, I think the innovations with this system would be:
- Combining dictionary data from competing, or at least separate, companies into one master database.
- Adding regional data based on user location.
- Weighting headwords and definitions for popularity based on user searches and direct “thumbs up” style voting.
- Infinite, customizable print dictionaries created from the data.
Wikipedia is a good example of the power of collaboration, but there’s nothing built in to allow third-parties to use and then contribute data back into the database. They’ve got the support of their own foundation (the Wikimedia Foundation), which we don’t have. Also, Wikipedia offers printed books through a company (PediPress) that handles all the printing details. We don’t have anyone like that helping us either. But we don’t really need it because CreateSpace is so easy to use anyone can make a book! However, there are a few issues to solve before this can all come to be.
Who would “own” the Super Dictionary? There are “copy-left” licenses that could be applied, but decisions will have to be made, and some thought needs to be given to the legal side of this project. Also, participating users and companies will need to know what their rights are.
The sort of project will require a team of very smart people who are good at not just programming, website design, and user interface but also data management, statistical theory, and also have some savvy about China and the Chinese language. I can’t do it. I’m not sure any one person can. And even if we found one person who could do it all, would he or she work on it for free?
Even if we found a team of talented, motivated people who would volunteer their time for this project, the server and bandwidth still costs money.
Nicholas Carr, former executive editor of The Harvard Business Review, said in an NPR interview about Wikipedia and user-created web content said: “Pretty much the only business model in what’s called ‘web 2.0’ is to get as many people as possible to look at your site and then feed them advertisements.” Carr seems to be saying the future of making money off information, isn’t by owning the information itself, but by advertising revenue. I’m not sure whether that would be enough or not.
An academic institution or foundation that just has a bunch of money lying around would be a great solution to the problem. Anyone know someone like that?
I would hate to see collaboration with other companies discouraged because of licensing disputes or making the partner companies shoulder the financial burden of the project. If some companies were willing to chip in and it didn’t discourage their participation, then that would be fine. But I think some thought needs to be given to the bottom line.
The success of the project depends largely on how many partnering companies can be brought together to share their vocabulary usage data (point number 5 on the diagram). All businesses must ask the question “But what’s in it for me?” I’m not sure my answers are good enough for the bottom line: the quality of your product will improve as the database improves. Also, it might be good press for participating companies. I’d love to see some sort of logo that gets slapped on each participating website that shows they’re using and contributing to the Super Dictionary database. If users were educated about what it meant, it would mean the customers would have more confidence in joining a service that’s participating rather than one that’s independent.
But should all companies who want to participate be allowed? Who’s going to screen them? How will their data be used / weighted? These are all problems to solve.
If the Super Dictionary does lead to a print book (or many print books) as I hope it will, how will that come to be? How will the PDFs required to print the books be generated? Where will the revenue from the printed books go?
There’s no reason why only English should be used for the Super Dictionary. But making a Chinese-multilingual dictionary is a much bigger project than just Chinese English. Still, with the right talent on board, it might make more sense to design it to accommodate other languages from the beginning so even if it starts out as only Chinese/English, it could be expanded to Chinese/you-name-it more easily.
So what to do now? Any suggestions? You’re welcome to leave comments here on this blog. Or someone could start a Google Group or something. I’ll help if I can.