« New Climate Models at NSF | Main | Responsible Nanotech Student Program »

Digital Library of India

Open Access News points us to the Digital Library of India, which has as its goal the digitization of significant literary, artistic, and scientific works for free distribution and appreciation. Unlike the Carnegie-Mellon Univeral Library (which helps to coordinate the DLI), the DLI is focusing on works primarily in Indian languages. Books are digitized by scanning (and readable as page images), and are made available in text via optical character recognition, or OCR. This presents some interesting challenges:

  • There are1500 spoken Indian languages and 17 scripts.

  • Unlike English, where the number of characters to be recognized is less than 100, Indian scripts have several hundred characters to be recognized.

  • Non-uniformity in the spacing of the characters within a word because of the presence of Consonant Conjuncts (vowel + consonant) makes OCR more difficult. Also, the presence of Consonant Conjuncts results in improper line segmentation.  Programs will have to do further processing to segment the lines.

  • Consonants take modified shapes when attached with the vowels. Vowel modifiers can appear to the right, on the top or at the bottom of the base consonant. Such consonant-vowel combinations are called modified characters. In addition, two, three or four characters can combine to generate a new complex shapes called compound characters. These characters are very difficult for a machine to recognize.

  • In scripts like Bangla and Devnagari, all the characters in a word are connected by a unique line called shirorekha (also called head line). In these scripts, character segmentation is especially difficult.

  • In south Indian scripts, vowels occur only at the beginning of a word as against the vowels in Oriya, where they occur anywhere within a word. So, the language morphology for some groups of scripts is different from the others.

  • There is no universally acceptable standard encoding scheme for Indian scripts. This necessitates a scheme where the output labels from the OCR system can be mapped to the labels used by the typesetter through a mapping table.

At this point, they've scanned about 100,000 books -- 10% of their eventual goal, a million books available to anyone, anywhere, with a web connection.

TrackBack

Listed below are links to weblogs that reference Digital Library of India:

» Digitizing Indian Books from Living in India
The challenges of creating the Digital Library of India - which aims to digitizing significant literary, artistic, and scientific works for free distribution and appreciation.   ... [Read More]

» Digitizing Indian Books from Living in India
The challenges of creating the Digital Library of India - which aims at digitizing significant literary, artistic, and scientific works for free distribution and appreciation.   ... [Read More]

» Digitizing Indian Texts from Conversations with Dina
The challenges of creating the Digital Library of India [Read More]

About

This page contains a single entry from the blog posted on June 23, 2004 5:39 PM.

The previous post in this blog was New Climate Models at NSF.

The next post in this blog is Responsible Nanotech Student Program.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34