Tamil English Multilingual Web Interface Design[தொகு]

By Natkeeran Ledchumikanthan

Introduction[தொகு]

Imagine your self with out the knowledge of English or other modern languages. How will you use a computer? How can you e-mail a friend, browse the Internet, or listen to MP3 on line? You can not. In some cases the limitation extends to entire societies who lack the Information and Communication Technological (ICT) infrastructure needed to support computing in their respective languages. Just like reading, writing, and numeracy, computing is a sphere of modern life that no human can afford to ignore. Thus, there is an urgent need to accommodate and develop technologies to support multilingual computing.

Today's Information and Communication Technologies (ICTs) were primarily developed by and for English speakers. English is the defacto international language. It is the language of science, technology, commerce, diplomacy and popular culture. The prevailing dominance of America, England, Canada, and Australia in today's world order will sustain and expand English. However, the world is diverse with thousands of languages. Cultural and linguistic diversity must be preserved as they posses "solutions to problems that have not yet been resolved in the metropolitan societies and languages" [G1]. Valuable knowledge and culture is invested in each language, and a language can not be preserved and expanded with out adequate information and communication technological infrastructure.

United Nations Educational, Science, and Cultural Organization (UNESCO) reports and responds to the need to develop multilingual ICT. It reports that "Thousands of languages worldwide are absent from Internet content and there are no tools for creating or translating information into these excluded tongues. Huge sections of the world’s population are thus prevented from enjoying the benefits of technological advances and obtaining information essential to their wellbeing and development."[G2]

UNESCO tries to responds to the need for ICTs with its "Recommendation concerning the Promotion and Use of Multilingualism and Universal Access to Cyberspace" for its member governments. Also, projects like "Initiative B@bel", and Community Multimedia Centers develop ICTs.

Even healthy languages with numerous speakers and vast resources such Hindi, Chinese, Arabic, and Russian are slow to adopt to technological developments, while thousands of minor languages simply do not possess any capacity. This is particularly true in India, which houses about 850 languages including 18 official languages. For instance, majority of the Indian websites are in English. Moreover, "Brahmi-origin scripts users in South-East Asia and Indic scripts users occupy 22% of the world population have just 0.3% of Internet access." [G3]. Although, the limited access is due to economics, the language factor is certainly a significant barrier.

In this wide social, economical, and cultural context is the technological issue of developing multilingual (Tamil/English) multimedia web interfaces, the focus of this study project. Human computer interaction is perhaps the core issue in multilingual computing, and interface is the junction of that interaction. Thus, focusing on interface design will lead to the exploration of variety of technological issues related to multilingual computing in a guided and finite manner.

In particular, this study project consists of two main parts. Part I surveys Tamil Computing as a case study of Multilingual Computing (with emphasis on Multimedia issues). The purpose of the part I is to become aware of the issues in multilingual computing. Part II builds upon that awareness and explores more specific language related issues in building a multilingual web interface by implementing a Tamil English web interface.

Part I: Overview of Tamil Computing as a case study of Multilingual Computing[தொகு]

In this part of the project, first the term "Tamil Computing" is defined. Second, the general technological issues are identified, categorized along with the organizations and people working on the areas. Moreover, resources such as web sites, books, research papers, and software are catalogued.

Tamil and Tamil Computing[தொகு]

Tamil is an Indian classical language. More than 74 million [G4] people around the world use Tamil. It is an official language of the state of Tamil Nadu (India), Sri Lanka, Singapore, and Malaysia. Significant Tamil populations reside in Reunion, Mauritius, South Africa, Middle East, Europe, US, and Canada (Toronto). Thus, Tamil is one of the vibrant and international spoken languages of the world. Refer to Table 1 in appendix for population details.

Tamil's origins are described in mythology. The foundational grammatical work "Tolkaappiyam" which mainly outlines the Tamil linguistic structure dates back to 200 BC. Tamil has a rich and continuous literary tradition, and retains a large portion of Indian thought and culture.

Considering history, richness of literature, number of speakers, international and modern presence Tamil is one of major language of the world. Tamil can not be compared marginal languages such as indigenous languages of North America and Africa, or hundreds of other Indian languages. The problems that Tamil Computing faces can be multiplied by several factors in the context of marginal languages.

Tamil Computing refers to interacting with computer in Tamil or developing human computer interfaces that enables the user to interact with computer in Tamil. Although that definition may seem limited or narrow, in essence that is the prime objective of Tamil Computing.

At the electrical signal level computers are language blind, the operate on ones and zeros. But, from machine language onwards the computers become increasingly language conscious. All computer languages including machine languages, assemblers, compilers and higher order languages use ASCII. This facilitates sorting, and lower level English language processing.

A novice Tamil Computing enthusiasts usually asks why is it not possible just built everything from scratch, perhaps a new computer platform with a Tamil operating system, a Tamil intranet and so forth. Such a scheme is not far fetched for large resource rich languages such as Chinese, Japanese, Arabic or Russian. But, for Tamil the economic and technological viability is just not there. Adaptation to the existing platforms, infrastructure, and standards is the key strategy for Tamil Computing.

In an effort to understand issues in Tamil Computing a non empirical survey of technologies resulted in identifying the following as twenty key technologies. They are interlocking technologies, and are note necessarily listed in the order of importance. The highlighted technologies are discussed in more detail below, and certain other are discussed in part II of the project.

Twenty Key Technologies for Enabling Multilingual Computing[தொகு]

. Font Processing: Encoding, Decoding, Display, Printing
. Tamil Internet Technologies (Web Sites, Email, Chat, Browsers, etc)
. Searching, Storage, Indexing, and Retrieval
. Tamil Speech Recognizing and Processing
. Free and Open Source Development
. Natural Language Processing
. Machine Translation
. Human Computer Interaction and Tamil Interface Design
. Tamil desk top environment
. Tamil Character Recognition
. Speech to Speech Translation
. Tool Development (Software, Hardware)(Simputer, Suratha, tools.tamizmanam.com)
. Standards
. Localization and Internationalization and Multilingual Computing
. Blogging Technologies
. Content Development and Management Technologies (Wiki, Digitalization, Digital Library)
. Mass Applications
. Tamil Multimedia Content Development and Development Tools(movies, animations)
. Expert Systems and Scientific and Mathematical Computing Software in Tamil
. Cell Phone, Hand held Language Rendering Technologies

Font Processing[தொகு]

The first frontier in multilingual computing is inputting language, doing some basic manipulation, rendering it on screen, and printing. The strategies used to overcome this challenge points to the general methods of adopting the standard English dominated computing environment to a multilingual environment. Computer users are familiar with QWERTY keyboards (statistically Dvorak keyboards are better) for typing in English. Each character in the keyboard corresponds to a character in English alphabet, and average user seldom thinks about the processes. Now consider languages such Chinese with 4000 ideograms, Tamil with 247 characters, and Arabic which uses ligatures highly; how can one use QWERTY to input these languages? [F1] Modifications to the English encoding scheme are necessary to enable font processing using other languages.

There are 247 Tamil characters, 12 vowels, 18 consonants, 216 compound characters, and one special character. 192 compound characters can be expressed using combination 10 additional glyphs and the consonants forms. There are 36 compound characters which consist of unique forms derived from consonants. Thus, according to current conventions Tamil requires (12+18+10+1+36)=77 unique forms to encode [F2]. This is a particularly challenging to adopt to conventional English QWERTY keyboard, with its basic 26 English letter encoding.

Currently there are three popular encoding schemes TAB, TAM, and TSCII. However, there is bickering over which one to adopt as the standard. As of now there exist no universally accepted Tamil encoding standard. This is one of the Achilles hills of Tamil computing. The arrival of Unicode was meant to resolve this issue; however, serious technical mishaps in the Tamil Unicode (UTF-8) has generated a lot of grievances and resulted in the proposal of a more effective Tamil UTF-16 standard. In short, Tamil does not have single encoding standard, and it is a serious social and technical shortcoming that is hampering the development and widespread use of Tamil in Computers and the Internet.

A popular method of keyboarding Tamil using the conventional keyboard is through Romanized Transliteration scheme. Tamil is a phonetic language, thus easily lends it self to transliteration. The defactro standard Tamil transliteration scheme is that employed by Suratha in his online Tamil Writer. The advantage of the scheme is that if you learn English typing, you would not have to learn Tamil typing all over. The disadvantage is that it seriously limits the speed with which Tamil can be typed, because it does not take into consideration the statistical probabilities of Tamil characters.

Rendering Tamil onto the screen has also been challenging as Tamil is a “grapheme” based language. For ligature oriented languages such as Arabic and Hindi, the rendering poses additional challenges. Software is necessary to support rendering, as well as encoding. In sum, Unicode seems to be the optimal solution for multilingual computing, if it preserves the integrity and efficient processing of all languages involved.

Tamil Internet Technologies[தொகு]

During the 1980 Tamil fonts were developed and various schemes were used to encode and display Tamil on the word processing applications. During the early days of the Internet, before the web, there were Tamil people using the Internet, but not in Tamil. (The usernet group tamil.soc.culture.net was notable. Some transliterated work was posted on the Internet, even a magazine called “aa”).

The release of the World Wide Web on 1991 by Tim Berners-Lee, ushered in the Internet Revolution. English was the defactro language of the Internet. The adaptation of Tamil to the Internet can be segmented into three developmental phases.

Phase1: Save Tamil content as an image and display it as image. Phase2: Use Tamil Fonts Phase2: Dynamic Fonts (helped less Tech savy users) Phase3: Unicode-8 Phase3: Unicode-16

In the first phase Tamil content was saved as images and rendered. The second and more foundational phase was undertaken primarily by the pioneering work by Naa Govindasamy. His conference paper “Towards a Total Internet Solution for the Tamil Language through Singapore Research” summarizes his pioneering and foundational work in this area. He used Tamil fonts to view Tamil web pages. The third phase is the dynamic fonts which made possible less Tech savy users view Tamil pages. Now the current phase is the use of Unicode. Today, if Tamil is enabled in Widows XP or with additional software support one can e-mail, browse, or chat in Tamil. The computers and the Internet are aware of multi languages. Issues related to Internet and Tamil will be further discussed in Part II.

Searching, Storage, Indexing, and Retrieval[தொகு]

One of the primary uses of computers is that it allows for inexpensive, accurate, and accessible storage and retrieval of information. The type of information stored (text, images, audio, media, multimedia), the way to store (files, databases, spaces), and retrieve (indexing, directories, search) are continuously evolving.

“Indexing deals with finding representation for information in a document, and organizing that representation to facilitate efficient search” [Guan]. The traditional indexing and retrieval methods are filing and bibliographic systems. File system is the main paradigm of information organization in the computer. Many of the techniques involved in file management in computers are analogues to real world file management processes. The file management involved naming files, organizing files into folders, and tasks such as copying, appending, deleting, and organizing files and folders. The processes are mainly manual. The databases evolved to help manage specialized data management needs. Bibliographic system can be seen as special extension of file management or databases, as they sought to index and retrieve information about books, collections, audio, media or multimedia.

The process of finding information just based on file names or bibliographic information became limiting. Text based searching and retrieval techniques evolved to aid in locating relevant information stored in files and databases. “Searching deals with capturing and presenting an information needed, and assessing its relevance” [Guan]. Text based searching allowed for matching of queries or keywords with content space in files. Three main limitations to content based searching and retrieval are semantic ambiguity, synonyms (leading to irrelevant retrieval), and “false drops”. In English, the above problems have been somewhat overcome by more intelligent search algorithms using methods such “knowledge based systems, fuzzy systems, neural network learning, and relevance feedback”. Also, incorporating meta data information has enhanced search. This leads to the next generation of searching algorithms termed concept based or semantic retrieval. The semantic search and retrieval techniques will further enhance search of text and multimedia information.

Searching and retrieval algorithms are heavily language dependent. The efficiency can be improved by exploiting language features. Thus, it remains a vital area for Tamil Computing development. Algorithmic, font, NLP, infrastructure, and interface issues need to be considered in Tamil search engine design. It is notable that AU-KBC Research Center recently demonstrated a Tamil Search engine.

Moreover, ancient Tamil manuscripts are being scanned and stored as images in databases in the universities or by organizations. How do we recover information from the manuscript stored in data bases? Presumably, the designers would have worked out an efficient indexing and retrieval scheme. However, content based image search mechanisms can enhance such scheme. If we want to know particular subject, that subject matter can be queried and the search algorithm can search the content of the images to recover the required information. Please, refer to references for resources related Tamil search.

Tamil Speech Recognition[தொகு]

Speech transcription, interactive voice response systems, voice enabled computer interfaces, speaker-language recognition systems, and speech-to-speech machine translation systems are becoming standard applications. Speech recognition is an enabling technology for the above applications. Voice recognition for English is a standard software application. However, speech recognition in Indic languages, and particularly in Tamil are still in development.

As noted by [S2], four main approaches to speech recognition exist. They are (1) template based approaches, (2) knowledge-based approaches (3) stochastic approaches (4) connectionist approaches. In the first type, the input speech signal is compared to a signal in a look up table, and a match is found. In the second approach, linguistic data bases (word corpuses, lexicons) were incorporated in an effort to recognize speech. Stochastic approaches “exploit the inherent statistical properties of the occurrence and co-occurrence of individual speech sounds” [S2]. Fourth approach uses “networks of a large number of simple, interconnected nodes which are trained to recognize speech” [S2], for example neural networks can be trained over time to recognize speech.

The first and the third methods extensively use signal processing techniques. The basic idea is the “extraction of information from the acoustical speech wave” [S2] to recognize the speech content. There are various techniques [S3] that can accomplish that task:

Waveform analysis (correlation and covariance analysis can be used along side)
Pitch analysis (uses frequency of the signal)
Spectrogram and waterfall spectrogram analysis (uses amplitude, frequency, and time information in the analysis)
Formant analysis

Moreover, it is important to take into account the features of the language in deciding upon a technique for speech recognition. Number of features of the Tamil language renders it theoretically easier for voice recognition. In general, Tamil is a "spell as it sounds" language. That is Tamil is a phonetic language, "words are written as they are pronounced, and pronounced as they are written." Which means there is only one way to pronounce A, AA and each correspond to different letters or characters in the Tamil alphabets. Many English voice recognition schemes do not use phone as the base unit of recognition because the same English characters can represent different phonemes. However, each Tamil character has a unique vocal sound (in general), thus can be used as a base unit for speech recognition. If efficient Tamil speech recognition is developed, it can enhance Tamil Interface development.

Tamil Natural Language Processing[தொகு]

The foundation of Tamil Computing depends on the development of natural language processing capacities. Natural language processing (NLP) refers to developing computational techniques to model, analyze, and generate natural languages. AU-KBC Research Centre is the principal and perhaps the only research institution for Tamil in this area. Critical applications such as human computer interaction, searching, machine translation and various other heavily depend on development of NLP. The center has developed among other tools the following: Parse Representation of Tamil Syntax, Tamil Morphological Analyser, Development of Lexical Resources, and Electronic Ontological Representation of Tamil Vocabulary (Tamil WordNet).

Machine Translation[தொகு]

Text to Text machine translation, and speech to speech machine translation of natural language has been achieved between sever languages. For instance, Google provides translation between European languages. Also, Japanese websites can be machine translated into English. However, the Machine translation of Indic languages is not yet available, and the work towards has not resulted in any mainstream applications. To achieve machine translation the natural language processing capabilities of a language must mature. Considering the Tamil community, the machine translation between the following combinations seem to be prime importance: Tamil-English, English-Tamil, Tamil-Hindi, Hindi-Tamil, Sinhala-Tamil, and Tamil-Sinhala. (Very limited treatment of NLP and Machine translation is provided as they do not directly relate web interface design; however, this area is prime importance for Tamil Computing.)

Tamil Free and Open Source Software[தொகு]

Richard Stallman founded the Free Software Foundation in 1985 to develop "free software". "Free software gives everyone the permission to run the program, copy the program, modify the program, and distribute modified versions-but not permission to add restrictions of their own"[pg59]. The Free Software Foundation’s project to build a free operating system was the GNU or “GNU’s Not Unix” project. Some of the people from the Free Software Foundation wanted to disassociate themselves from the perceptive association of “free software” with anti- businesses, anti-intellectual property rights and communism, thus founded the Open Source Initiative to appeal to the business. .

The free and open source software are fundamental for developing multilingual or Tamil Computing software. Not just the software, the method of software creation, but also overall philosophy. Rather than waiting for large corporations to respond to various linguistic communities’ needs, that they may be structurally incapable of delivering, free and open source allows for the community to respond to their own needs.

The Tamil Nadu Chapter of the Free Software Foundation was inaugurated by Richard Stallman at the Swatantra conference. The Free/Open concept has been chosen as the vehicle for Tamil Computing development. In Tamil, the Tamizaha Group lead by Mugunth, and Tamil Linux Group lead by V.Venkataramanan, and Zh Group have all been localizing and developing Tamil Computing software with Free Software spirit, but without the strict definitions around licensing and methods. Now there is an effort to formally define or adopt Free Software General Public License, and more vigorously commit to such software projects.

The GNU project has lead to the development of various software, software development tools, and applications compilers, debuggers, editors, and GNU/Linux, and GNOME. Various efforts have been made to translate, and translate various software into Tamil. Among the pioneering efforts was to translate KDE (K Desktop Environment) into Tamil. Work was also undertaken to enable various Linux distributions to support Tamil. Although, initial work was undertaken to localize GNOME, there seem be no activity in that are at the current movement. The main Tamil Free and Open Source software resources are summarized under References.

Part I Summary[தொகு]

In Part I of the project twenty key Tamil Computing technologies were identified. The following technologies font processing, Internet technologies, indexing and retrieval, Tamil speech recognition, Tamil natural language processing, and machine translation were discussed in more detail. Further, the importance of Free and Open Source Software and development methods to Tamil Computing was noted. Web sites related to these fields are listed in the references.

Part II-Language Issues in Multilingual Web Interface Design[தொகு]

The world wide web first came into the public Internet on 1993. It became the catalyst for commercialization and exponential growth of the Internet. During the early 90’s the web design was largely depended on HTML; however, the technology and art of web design has quickly matured to a phase where degrees are granted to people specialized in web design. Part II of this report explores specific technical details associated with building a multilingual web interface design. The discussion is based on experience implementing a Tamil web interface with an experimental setup for testing new web technologies.

Requirements[தொகு]

Some of the basic and new web interface technologies are HTML, Unicode, Javascript, CSS, RSS Content feeds, Blogging, Wiki, XML, Databases (PHP-MySQL), Forums, Search, Security, Multimedia Content, and Shopping Cart. With limited time and expertise, the target was not to test all technologies, but to test minimum of five features and others as time and resources permit. Moreover, the code is to be validated by W3C or other standards.

Generic Web Site Development Process[தொகு]

There exist a generic web site development process, which involves planning, designing, implementation, testing, and maintenance. Professional web developers also have optimization and promotion phases. During planning the objectives of the web site, a vision of the end product, and the content should be developed. The web site elements such as homepage, navigation, interactivity, graphics, network and server design are designed during the design phase. The implementation involves tool selection and coding. Then, testing is conducted to ensure among other things correctness, efficiency and visual presentation. Proper testing can help one avoid broken links, typos, browser incompatibilities and slow downloading. Most web sites are dynamic, the content and the technologies need to be updated to remain current and relevant. When designing multilingual web sites language issues need to be considered throughout the development process. Before discussing the language issues involved in web interface design, a short discussion about user interfaces is provided.

Web User Interface[தொகு]

“In computer science and human computer interaction the user interface (of a computer program) refers primarily to the graphical and textual information the program presents to the user, and the control sequences (such as keystrokes with the computer keyboard and movements of the computer mouse) the user employs to control the program” [wiki]. During the early days of the Internet, the interface was mainly command line based. The introduction of WWW and the browsers lead to the graphical user interfaces. The conventional widows-menus-mouse desktop interface was adopted to web interface with added functionality made possible by web technologies such as hyperlink.

The user interface technologies are rapidly advancing. For instance Archy by Jef Raskin, the concept of “Lifestreams” [U1] by David Gelenter, and multi-model are examples of cutting edge web interface design. According to its creators Archy extends the zooming feature to allow for “direct manipulation” of the content instead of the icons as we do with traditional user interfaces.[U2] David Gelenter’s “lifestream is a sequence of all kinds of documents” arranged chronically which can be browsed and retrieved based on the content. The documents need not be named or arranged hierarchically. Further, he believes that many web sites will be designed as “lifestreams”, and blogs are partial implementation of that concept. Multi-model web interfaces combine mixed input schemes such as speech, text, gestures (using mouse, keyboard, touch screen, camera etc) and mixed or multimedia output schemes (virtual worlds, animated agents). In sum, web interfaces are becoming more aware of the user, responsive, human like, and intelligent.

The aim of the project is not to implement a novel web interface, but to implement a conventional web interface in Tamil. The basic user interface philosophy for the project can be summed as: functionality, simplicity, elegance, minimalism, and user help.

Language Issues in Implementing a Multilingual Web Interface[தொகு]

Although, technically implementing a Tamil web site seems trivial, it is not trivial in practice. There are only about 20 notable Tamil web sites with high degree of functionality, and according to dmoz.org Tamil directory editor about 250 quality Tamil web sites. A google search for the keyword “Tamil” brings mostly Tamil related English web sites. As stated before, the limited presence of Tamil or Indic languages is partially due to economic conditions (low penetration of Internet among Tamil population), but also due to language related technical issues.

Language is an issue through out the development process. Language is a factor in tool selection (editor, browser), font processing, getting user input, server and network design, data base design and overall aesthetics.

Domain Name[தொகு]

First of all, the domain name can not be registered in Tamil. (A purely Tamil name is not necessary for a multilingual page, but it may be preferable for a Tamil only web site.) Currently Internet Corporation of Assigned Names & Numbers (ICANN) which oversees domain names have created Multilingual Internet Names Consortium (MINC) to provide support for multilingual domain names. An INFITT working group is working towards enabling .tamil registration. For this project the domain name www.tamilscitech.org was registered with 1&1 Internet Inc.for 5.99 for one year.

Font Processing: Input and Rendering[தொகு]

A major issue in Tamil web interface design is deciding upon a font encoding scheme. The issues related to font encoding were discussed above. Using Unicode enables standard rendering or output, but receiving input remains a challenge. The issue is how to type in Tamil in a form, see in Tamil as you type, and transfer that information for storage or processing. Enabling Unicode for viewing the web site does not automatically support Unicode input. Additional font processing software is required by the user. E-Kalappai, Murasu, and Kural Tamil Software are keyboard managers that enable Tamil input. The challenge however is to enable Tamil input without additional software requirement. To this end Suratha’s transliteration scheme and Eelam Writer are useful. The shortcoming of Suratha’s is that it does not function properly with Firefox. Moreover, Sun Microsystems is developing UNIicode Table based Input scheme as part of its “Internet/Intranet Input Method Framework” (IIIMF). This however, was note tested during the project. In the implementation, Suratha’s open source code was modified to enable for input; however, some trouble shooting remains to fully process the input data.

UTF-8 encoded HTML Web Pages[தொகு]

First set the character set as follows in the between the <head></head> tag.

It is also recommended to set the specific language tag as follows: <html lang="ta-IN">. ta stands for Tamil, and IN stands for India. Simple as that, one can create Unicode enabled multilingual web pages.

Server and Network Decision[தொகு]

The web sites are stored in the Internet Service Provider servers (commonly referred to as web hosting). Thus, considering the type and level of server side is a critical element of web site development. Support for databases, content management technologies, programming languages, multi-languages, storage capacity and price need to be considered. Tamilzmana.com is a popular Tamil blogging confluence site with high degree of functionality. It is hosted by Ardent Hosting. Knowing and further scanning of Ardent Hosting web site showed that it provided support for various features that the project set out explore, thus it was selected to host the web site.

Tool Selection[தொகு]

In designing a multilingual web interface design, the tool selection becomes a vital issue. Not all popular tools in the English environment support multilingual development. For instance, the popular HTML editor HTML Kit does not provide Unicode support for Tamil. Microsoft supports Unicode, and MS FrontPage was selected as the main HTML or web page editor. The disadvantage of the MS FrontPage is it does not provide validation support.

Internationalization and Localization[தொகு]

Internalization is the process where the software is written such that the interface does not depend on specific language or cultural conventions. The language dependent or interface keywords are separated to enable for localization. In this project the dockwiki is a good example of software that has been internationalized. The language dependent source files are separated and placed in a separate folder. Localization refers to creating language specific files to enable for use of the software in a particular language. In the case of dokuwike Tamil language files are not provided with the standard download; however, partly translated language file is available. This project seeks to provide complete translation; however, this was not completed upon the demo.

Tamil Interface Conventions or Philosophy[தொகு]

English user interfaces have conventions for terms and features. Tamil interfaces are just evolving, thus it lacks such conventions. Direct translation of terms or interface conventions from English to Tamil does not always yield satisfactory results, and best practices from Tamil interfaces needed to be identified and promoted.

Content Aggregation[தொகு]

Most of the current www Tamil content is made available by bloggers, and usually its free if proper credit is given. For this project most of the content were aggregated using RSS syndication. RSS syndication allows for Unicode rendering, thus only content available in Unicode can be syndicated this way. This is another reason why Unicode is favorable to local Tamil font encoding schemes.

Data Bases[தொகு]

Although, building a data based directory was part of the initial project features, it was not realized. Building a data base driven Tamil web site requires expertise in data base development, and database web interface development. With proper expertise, building a Tamil interfaced based data base is a standard process.

Implementation Screen Shots

Figure1: Home Page

Figure2: RSS Content

Figure3: Dokuwiki Figure4: Possible Input Scheme

Conclusion and Future Work[தொகு]

The project was undertaken due to lack of documentation or clarity on how to design simple Tamil web sites or interfaces. Soon, it was realized that lack of adequate multilingual support on the Internet was not just a Tamil Computing issue, but a concern for all languages except English. Tamil Computing which in essence deals with developing technologies to enable interaction between computers and people in Tamil became a good case study for multilingual computing.

In part I of the project various technological issues in Tamil Computing were surveyed and summarized. The primary issues are font processing, interface design, storage and retrieval, natural language processing, and machine translation. Unicode (utf-8) seems to be the standard international solution for font processing; however, its Tamil implementation has serious short comings resulting in inefficiency in storing and processing.

Interface design, storage and retrieval issues can be adequately addressed if the software is designed with internationalization and localization in mind. Natural language processing and machine translation requires fundamental research, thus more resources need to be allocated to these areas to advance Tamil Computing. Part I of the project provided the awareness on issues in Tamil Computing that enabled the task of building a Tamil web interface straight forward.

In part II of the project a simple Tamil web interface was implemented with due consideration to language issues. The generic web site development process is planning, designing, implementation, testing, and maintenance, and language should be factored in at each phase. Web site development is skill and an art. Depending on the requirements, building a web site can vary in complexity. From the implementation it can be noted that Unicode and Internationalized software enable building highly functional Tamil or multilingual web sites a standard process. However, mechanisms for receiving user input without additional software requirements from the user remained a challenge. The implementation provided content aggregated from RSS, sought to locally configure dokuwiki (not complete), and explored ways to receive input in Tamil using Java Script/Perl based schemes.

The project provided an opportunity analyze and document some of the Tamil Computing activity. The information and experience gathered is expected to be relevant to Tamil Computing enthusiasts. In the future, further research and documentation can be undertaken in the areas identified in Part I. Also, the functionality or features can be added to the web site.

References and Resources[தொகு]

General References [G1] http://faculty.ed.umuc.edu/~jmatthew/unesco.html

[G2] http://portal.unesco.org/ci/en/ev.php-URL_ID=18147&URL_DO=DO_TOPIC&URL_SECTION=201.html#obs

[G3] Om Vikas. “Human Computer Interface”, Apr 22, 2005. <http://www.elitexindia.com/paper.asp#3>

[G4] http://www.ethnologue.org/show_language.asp?code=TCV

Font Processing

[F1] http://indic-computing.sourceforge.net/handbook/tutorial.html

[F2] http://kasi.thamizmanam.com/wiki/doku.php?id=ukara_reform

Searching, Storage, Indexing, and Retrieval Links

http://www.au-kbc.org/research_areas/nlp/demo/tse/ http://www.jaffnalibrary.com/tools/google.htm http://www.google.com/intl/ta/ http://www.au-kbc.org/research_areas/nlp/projects/sengine.html

Tamil Speech Recognition References and Resources:

[S1] Uros Rapajic, "An Introduction to Multi-Lingual Speech Recognition", Oct 10, 2004, <http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol1/ur1/article1.html#History>

[S2] “Speech Recognition”, Oct 12, 2004, <http://www.hltcentral.org/htmlengine.shtml?id=827>

[S3] “Speech Analysis Tutorial”, Oct 09, 2004, <http://www.ling.lu.se/research/speechtutorial/tutorial.html>

Speech Analysis and Synthesis Overview http://www.icsi.berkeley.edu/eecs225d/spr01/lectures/lect_1_26.pdf

Speech Signal Processing http://www.hltcentral.org/htmlengine.shtml?id=825

Berkeley EECS225d Spring 2001 lecture slides http://www.icsi.berkeley.edu/eecs225d/spr01/lects.html

Spoken Language Systems http://www.limsi.fr/Recherche/TLP/theme4.html

Papers related to voicemail transcription http://www.research.ibm.com/voicemail/vmailpapers.html

Formant Analysis and Vowel Detection http://cnx.rice.edu/content/m11731/latest/

Main Free and Open Source Tamil Communities

Chris DiBona (Eds). (1999). Open Sources: Voices from the Open Source Revolution. London: O’Reilly.

Tamil Linux/Unix Enthusiasts and Developers http://groups.yahoo.com/group/tamilinix/

Thamizh Developers http://groups.yahoo.com/group/ThamiZhaDeveloper/ http://thamizha.com/modules/news/

Venkat http://www.thamizhlinux.org/main/

Tamil Free Software http://barathee.beigetower.org/html/index.php http://swatantra.info/

zhakaNini or Tamil PC http://www.zhakanini.org/index.php

Linux User Groups in Tamil Nadu http://www.linux.org/groups/india/tamil_nadu.html http://groups.yahoo.com/group/ilug-cbe/ http://www.asiaosc.org/enwiki/page/Tamil.html http://www.chennailug.org/index.php

Localization and Internalization Groups http://www.tenet.res.in/Donlab/Indlinux/ http://ta.openoffice.org/index.html http://translation.sourceforge.net/cgi-bin/registry.cgi?team=ta http://pootle.wordforge.org/ta/

Open Source Tamil and Indian Companies http://www.chennaikavigal.com/aboutus.htm http://mm.gnu.org.in/pipermail/fsf-india/2002-March/002267.html

K Desktop Environment (100% Tamil Support) V. Venkatarmanan & Vaseeharan Group http://www.tldp.org/HOWTO/Tamil-Linux-HOWTO/x36.html http://www.thamizhlinux.org/main/

GNOME Desktop Environment (Ramanan Selvaratnam, Dinesh Nadarajah) http://savannah.gnu.org/projects/gnome-tamil/ http://tamilgnome.sourceforge.net/index.php?page=howto-translate

Tamil Status and Linux Versions

Linux Version: GNU/Linux Tamil Support: http://www.thamizhlinux.org/main/ Developers: V. Venkatarmanan Resources: http://barathee.beigetower.org/html/index.php

Linux Version: Mandrakelinux Tamil Support: 48% (On Feb/26/2005) Developers: Badri Seshadri Resources: http://www.mandrakelinux.com/l10n/ta.php3

Linux Version: RedHat Tamil Support: 98% Developers: zh Group Resources: http://www.zhakanini.org/redhat.php

Linux Version: Debian Tamil Support: N/A Developers: Ganesan Rajagopal? Resources: http://packages.debian.org/testing/gnome/tamil-gtk2im

Web User Interfaces

[U1] http://www.edge.org/documents/archive/edge70.html

[U2] http://rchi.raskincenter.org/aboutarchy/demos.html