Chapter 9
Mining Verbal Data from Early Bengali Newspapers and Magazines
Contemplating the Possibilities
Purbasha Auddy
OCR: A Salient Tool for DH
Optical character recognition (OCR) is an important tool to mine verbal data from a digital image. Verbal data extraction enables further analysis, thereby laying the ground for performing digital humanities (DH) research. While OCR technologies have developed satisfactorily for Latin scripts, the progress with other scripts, especially those of the Global South, has much scope for improvement. Bengali is among the five most common languages in the Global South. It is also the seventh most widely spoken language in the world. However, its representation on the internet and in DH is not proportional to its number of speakers. One of the factors holding back its greater digital use is the absence of easy OCR facilities. While historical verbal data in Bengali in the form of images of Bengali textual matter is making its way into the digital realm, they remain locked for further inquiry and use owing to the lack of convenient OCR.
However, there have been a few important DH projects which used Bengali texts created by OCR for further computing. One prominent endeavor was the Bichitra project (http://bichitra.jdvu.ac.in), an online variorum of the works of Rabindranath Tagore, a project with which I was involved. The project (developed by the School of Cultural Texts and Records, SCTR, Jadavpur University, Kolkata) used a set of OCR-generated plain text (.txt) files of Tagore’s works (Chaudhuri and Ghosh, “Bengali Writing System,” 19–20) that could be searched through a customized search engine. Furthermore, this project developed a collation software, Prabhed, which used these plain text files to map the changes Tagore made across editions. The project developed “glance-able” information sets for Tagore’s works and used advanced text technology in Bengali. Instead of marking up texts using standard markup technology such as XML-TEI, the Tagore project developed a simpler set of symbols to encode the UTF-8 Bengali texts to present the structure of a printed text or a manuscript. The foundation of the entire project rested upon OCR-generated Bengali texts.
The SCTR is currently developing a digital Bengali historical dictionary. The project, titled Shabdakalpa, relies on the accumulation of OCR-generated Bengali texts across time to generate a corpus of texts to be mined for words and examples. Thus extracting texts from static digital images of verbal matter such as manuscripts, periodicals, and books paves the way for further digital manipulation and analysis, thereby enabling DH research. This chapter analyzes the current state of Bengali OCR programs to address the need for better-performing OCR software that recognizes Bengali. It refers to sample data sets from early Bengali newspapers and magazines to illustrate the potential features of such software. OCR-generated texts of early Bengali periodicals hold significant potential for data mining to bring out several bibliographical features. For instance, data mining the subscription rates of a periodical, say the first Bengali newspaper Samācār darpaṇ (1818), from an OCR-generated text can help to quickly plot the change in those rates over time. OCR greatly enhances the usability of text-based images by extracting verbal data and making it available for further digital processing such as sorting, searching, natural language processing, and so on.
Attempts at developing OCR software for Bengali started in the 1980s (Pal and Chaudhuri, “Indian Script Character Recognition”). However, there is until now no publicly available freestanding software package for Bengali OCR. Publicly accessible OCR tools for Bengali are available over the internet only as parts of bigger resources and are not suitable for large-scale humanities research in Bengali. There are multiple tools for text input in Bengali, but a dearth of resources for text extraction using methods such as OCR. The lack of a significant commercial market for Bengali OCR has also dissuaded commercial OCR software creators from including Bengali among their list of supported languages so far. However, given that South Asia presents the next big tranche of the global population who are yet to be admitted to the digital realm, global market forces are increasingly turning their attention to South Asia and its languages in order to create the next market for their digital products.
Performing OCR on Bengali Texts
In the Bengali script, as in some other Indian scripts, letters or glyphs are written spanning three strata or zones—upper, middle, and lower. Bengali is an “abugida” script, where vowels take on a modified shape if tagged to a consonant: for instance, প (p, a consonant) + ঊ (u, a vowel) = পূ (pu), the first component of my name in Bengali, পূর্বাশা (Purbasha). The basic consonants প, ব, and শ occupy the middle zone. The vowel tag ূ, representing ঊ (u), is placed below প, extending it to the lower zone. Again, what in full is the letter র (r) becomes a slanting line above the next consonant (here ব, b) if there is no vowel in between; this conjunct consonant র্ব occupies both the middle and the upper zones. Thus, an OCR tool in Bengali has to recognize many more glyphs than with the Latin or roman alphabet.
There are several Bengali OCR projects being conducted concurrently today, each trying to follow a different logic. There are also some doctoral theses which undertake experiments on Bengali OCR based on neural networks (Habib, “Bangla Optical Character Recognition,” 25). Others have experimented with character grouping, such as grouping together the characters প and শ, from the example above, on the basis of similar shape (Kibria and Al-Imtiaz, “Bengali Optical Character Recognition”). Paolo Monella has speculated that computing for non-Latin scripts was based on tweaking the mechanism of Latin scripts and working around the base principles devised for Latin scripts (Monella, “Scritture dimenticate, scritture colonizzate”). Thus, for OCR in non-Latin scripts such as Bengali, the algorithm does not proceed along intuitive ways.
Some of the well-developed OCR software packages available now do not support Bengali. New OCR (https://www.newocr.com), a cloud-based OCR tool based on the command-line OCR program Tesseract (https://github.com/tesseract-ocr/tesseract), which has features such as page-layout analysis, selection of page area for running the OCR, and support for poorly scanned or photographed pages and low-resolution images, does not work for Bengali. The few internet-based OCR softwares that do work with Bengali usually support only modern fonts. Such problems affecting Bengali OCR are somewhat reminiscent of the ARTFL Project (https://artfl-project.uchicago.edu) undertaken more than a decade ago. As that project dealt with both English and French, it was thought that keying in from scratch was better than finding an OCR solution (Olsen and McLean, “Optical Character Scanning,” 125–26), as someone had to correct it anyway.
Superimposing the OCR-generated text on top of the image (as opposed to creating a separate text file in addition to the image) can facilitate search results of a different sort. Searching across the OCR-generated text will directly open the images where the searched-for keywords are located. OCR-scannning textual images does not reduce the importance of the image: the visual elements of a page, especially its layout, are of equal importance. For example, another Bengali periodical, Saṁbād bhāskar (1839), started with three columns per page, but became a four-column newspaper in 1849 to accommodate more material. Mining data quickly across large volumes of images can open up new ways of thinking about their bibliographical features.
A great deal of human intervention is needed to create machine-readable Bengali texts, as existing OCR mechanisms cannot ensure acceptable levels of accuracy. Publicly available OCR tools do not recognize the formatting of older textual material. Software such as ABBYY Fine Reader can even OCR multiple-column text and superimpose the text on the image; but there is no software for recognizing multicolumn text accurately in Bengali. Despite such drawbacks, freely available software packages that work with Bengali, such as Google Drive OCR and i2OCR (http://www.i2ocr.com), work reasonably well for modern fonts without complicated page layouts. Older fonts yield a considerably higher proportion of junk results.
An experiment with i2OCR (which states that it supports Bengali and recognizes multicolumn document images) reveals that it cannot autodetect the image-language: the user has to let the software know the language to be recognized. Attempting to run OCR on an image from the above-mentioned periodical Saṁbād bhāskar, with four columns and dense text, produced only junk results. Putting an image of the Bengali periodicalGaspel māgājīn (1819), a bilingual journal with columns in English and Bengali, through the i2OCR software also produced junk results.
Google Drive OCR fared slightly better with the same two images. It autorecognized the language as Bengali. For Saṁbād bhāskar, it also recognized the characters, but made an unsatisfactory job of the four-column layout. The result was too erratic to serve any practical purpose. With Gaspel māgājīn, the inability to detect multiple columns produced equally impractical results. Google Drive OCR could, however, produce satisfactory results for images from the periodical Brāhmaṇ sebadhi (1821), as the images were single-column and monolingual. As expected, Kāśībārttāprakāśikā (1851), a lithographed periodical, completely stumped Google Drive OCR: not a single character could be extracted, as they were handwritten. Transkribus (https://transkribus.eu/Transkribus), an open-source OCR software which is currently being developed to learn as many words as possible, learns by being fed texts already created with Google Drive OCR.
Performing OCR on Latin Fonts in Old Texts
Applying OCR to early printed material yields similar problems more universally, with all languages. A case study that applied neural-network-based OCR to scan images of books printed between 1487 and 1870 by training the OCR engine OCRopus on the RIDGES herbal text corpus reports:
The problem of applying OCR methods to historical printings is thus twofold. First one needs to train an individual model for a specific book with its specific typography. This can be achieved by transcribing some portion of the printed text, which usually requires linguistic knowledge. Second, even if this model works well for the book it has been trained on, it does not normally produce good OCR results for other books, even if their fonts look similar to the human eye. We need to overcome this typography barrier in order to use OCR methods effectively in the building of a historical electronic text corpus. (Springmann and Lüdeling, “OCR of Historical Printings,” introduction)
When five volumes of the Historical Collections of Louisiana (published between 1846 and 1853) were digitized by Louisiana State University in 1991, it used Xerox’s Kurzweil K5200 integrated system, where the software asked the user to teach it when it was unable to recognize a character. Moreover, the OCR software confused words such as “bad” and “had.” As the software did not have any doubt in such cases, it did not ask for human help. Human intervention was thus needed all the more for final corrections at the last stage (Pirker and Wurzinger, “Optical Character Recognition of Old Fonts”). This project did not consider the page layout and formatting of the text in any special manner.
Experiments with OCR by the British Library’s online newspaper archive focused not only on character and word accuracy but also on the layout. Instead of the whole page, only some portions were selected as in the case of periodicals, pages have complex layout and have to be divided into sections for extraction of texts (Tanner, Muñoz, and Ros, “Measuring Mass Text Digitization Quality”). Trove (https://trove.nla.gov.au/about), a digital repository hosted by the National Library of Australia, has an interactive display for digitized newspapers: the user can choose an article from any column in a newspaper page. Beside the image, there is an uncorrected OCR-generated version of that particular section. However, the website has a provision for crowdsourcing or user correction of that OCR-generated text.
OCR in the Digitization Workflow
Digitization projects have different aims. Rarely is OCR a step in the workflow of such projects. For projects involving Bengali material, creating searchable texts from the digitized images is hardly ever one of the aims. Introducing OCR as one of the final steps of digitization, even of material with modern fonts, substantially increases both the budget and the completion time of the project. Thus, verbal data in Bengali digitization projects often remains as static images.
Larisa K. Miller has stated that for digital collections, even uncorrected OCR-generated texts can provide a substitute when it is not possible to create detailed item-wise metadata: the OCR-generated searchable archive can speak for itself. Miller points out that where it is difficult to create detailed metadata, digital collections can be broadly categorized as, say, “early nineteenth century,” or as the collection of a specific donor (Miller, “All Text Considered,” 536–38). She also suggests that the uncorrected OCR-generated corpus can be corrected through crowdsourcing (as Trove has been doing). For English texts in modern fonts, workable OCR packages come bundled with scanners. Such a plan is not very feasible for older material, or for material with complex formatting such as periodicals.
There may be a variety of causes for material being difficult for OCR software to read. Old material can get discolored or skewed owing to the binding; the ink on one side might show through on the other; there may be smudges or user marks on the page; the print might be faint, or the types broken (Anderson, Muhlberger, and Antonacopoulos, “Optical Character Recognition”). Such flaws underlie one of the reasons why CAPTCHA logic is sometimes used to distinguish human users from machines. OCR software often fails to make it past such kinds of material, or even material where the letters are printed too closely together, thereby confounding the logic of what OCR software understands by the definition of a letter or character. However, CAPTCHA is now being slowly phased off the internet and replaced by nonverbal visual matching or labeling patterns. This is because advanced OCR software is now able to decipher such CAPTCHAs automatically, thereby defeating the purpose.
The next step in digitization often involves processing the image, such as autocropping, color discarding, and conversion to black and white. Such postprocessing sometimes helps and sometimes hinders the job of OCR software. While color discarding and conversion to black and white often increases the contrast between the letters and their background, thereby helping the software to distinguish a letter, autoconversion software can also convert faintly printed letters to white or blank space. Some publicly available digital archives of early Bengali periodicals suffer from this defect.
Scope for OCR-generated Images in Bengali Online Archives
“Periodicals and Newspapers from Bengal,” part of the CrossAsia-Repository (CAR), is a prominent digital archive of such material. The web page that works as a gateway to more than 200 early Bengali periodicals and magazines is a hyperlinked list of titles transliterated in English. By clicking on the titles, the user reaches the metadata page of the periodical which has a brief description including dates, subject, and names of editors, with a link to the volumes available for download. The search will not yield results if one does not transliterate the search term using the same spelling as in the list.
The West Bengal Public Library Network (WBPL) has uploaded a huge number of old Bengali books and periodicals on DSpace. Here each volume of a periodical is subdivided by year and month, with each monthly or quarterly issue forming a single PDF file.
The South Asia Archive (SAA) has nearly 200 Bengali periodicals. This digital platform breaks down the PDF files even further and each item that appeared in a particular issue is a separate digital file.
Several volumes of Bāmābodhinī (1863), a Bengali monthly magazine for women, are available in all these three databases. On CAR, one can reach a specific volume of the magazine by following the alphabetical index of item titles and then clicking through to the required volume. WBPL goes one step further: the user can choose among the monthly issues of the same volume. On SAA, the researcher has to look at the list of the contents within each number of the journal and click on the required item. But all these databases confine their processing to the metadata level. None of them offers entry to the texts qua texts, only as images.
“Keyword” is one of the elements of the metadata structure of WBPL and SAA. In WBPL, the women’s magazine Bāmābodhinī is tagged with the keywords “serial publications” and “periodicals”; SAA also has the keyword “gender,” though all the items in Bāmābodhinī may not be gender-related. Bāmābodhinī appeared every month for sixty years and had several editors. It is difficult to justify having this single subject keyword in the metadata. In such situations, if full material of all the volumes were available as text files, it would increase the serendipitous discovery of new information. For example, had there been an OCR-generated verbal corpus of Bāmābodhinī, so that variations in the occurrence of words such as śiśu (child), paribār (family), deś (nation), bhāratbarṣa (India), and śikṣhā (education) during the different editorships could be tracked through distant reading, it would help to determine changing trends in women’s role and power in society, and identify texts for closer reading. Similarly, data sets could be created regarding functional aspects of the periodical such as subscription, distribution, and editorial policies.
Another repository, the HathiTrust Digital Library (started in 2008), does not have any Bengali material, but its interface has scope for data extraction. Like Google Books, it provides superimposed OCR-generated text on top of an image; but in addition, it also provides access to the text separately without the image. Moreover, it allows the user to consider each volume page by page. One can thus extract and copy the OCR-generated text of just a single page, as it is shown separately and not as part of a large text file.
The Internet Archive (1996) employs crowdsourcing to populate its database. Images of roughly 30,000 Bengali volumes are available on the site. However, whereas images of volumes primarily in English uploaded on the site have a corresponding OCR-generated text file, available both separately and superimposed on the images, the Bengali volumes do not have this facility. This is because the site uses ABBYY FineReader 11.0 for performing the OCR, and this program does not support Bengali. Users can upload out-of-copyright books on the site following their guidelines and entering appropriate metadata. The server processes the display according to the metadata provided and generates certain kinds of files for each imaged volume. However, as said above, no text file is generated by OCR generated for Bengali volumes. Google Books, HathiTrust, and the Internet Archive all provide uncorrected OCR features—that is, OCR without any human intervention.
Project Gutenberg (1971) is one of the oldest digital libraries. It houses edited plain-text versions of books along with some other automatically generated file formats such as HTML, EPUB, and Kindle. Though some books are keyed in character by character, nowadays, more often than not, the books are scanned and a text file generated using OCR. Then comes the most time-consuming task, manually correcting the errors. This crucial function is performed by the Distributed Proofreaders community (https://www.pgdp.net/c/).
The Bengali Wikisource community has been active since 2007. This is different from the Wikipedia project, though they use common tools and are sponsored by the same foundation. By the end of 2020, the Bengali Wikisource site had uploaded images of 3,787 volumes; forty-six volunteers were working on 9,340 texts. They are now more focused on proofreading the OCR-generated texts and validating the pages with reference to the original book. This massive process of correction vividly brings out the shortcomings of Bengali OCR in its current state.
When a book is uploaded onto Wikisource, an interface is created for each image in the volume. This interface has an embedded Google OCR tool: a volunteer clicks the OCR button to generate the text for the image, which can then be edited with HTML tags. After editing, a text is available in ebook, PDF, and MOBI formats. Wikisource follows the double-checking method of editing texts; hence it is a long-drawn-out process.
However, for Bengali newspapers and magazines, Wikisource OCR cannot provide acceptable results. In 2018, I tried an experiment by uploading two volumes of the Tattvabodhinī patrikā (1843) to the site, but the inbuilt Google OCR tool could not distinguish the columns: the text was unusable, though it can be of some use in generating search results. For example, if someone searches for ধর্ম (dharma) in Bengali, and if the word appears in the OCR-processed volume, the user will be taken to a page/image of the Tattvabodhinī patrikā.
Finally, the Society for Natural Language Technology Research has made available the corpus of seven prominent Bengali writers: Bankim Chandra Chattopadhyay, Rabindranath Tagore, Saratchandra Chattopadhyay, Swami Vivekananda, Kazi Nazrul Islam, Sukanta Bhattacharya, and Satyajit Ray. It provides only text-based web pages of the corpus. The text was created using OCR, but the images are not available on the site.
Employing Nonstandards to Reach the Ground Truth
Further development of Bengali OCR requires more large-scale collaboration. Given the lack of financial support—the main impediment in Global South DH—crowdsourcing, or more organized volunteer work monitored by specialists, seems a way of overcoming the challenges (see the arguments in Fiormonte et al., “Politics of Code”). Corrected OCR-generated texts created by Wikisource volunteers can be used to “teach” prospective OCR software to learn various fonts. Different visual renditions of the same letter must be made available in image form from the various volumes from which the OCR-generated texts have been extracted. Rather than teaching the OCR software what a letter should look like, various forms of a letter can be entered in the software, and it can be left to figure out, from the averages of the various forms, what the letter basically looks like. This is the principle of machine learning and artificial intelligence, which show high levels of accuracy in visual pattern recognition.
One of the other problems of OCR software is that if the software encounters a word that it cannot match from a lexicon, it tries to replace what it encounters visually with what it can find there. Usually, the most recent authoritative dictionary in a language is used to develop OCR software. For historical texts, if dictionaries from that period can be used, the accuracy level of the software will be higher. Thus, OCR software packages might be made to incorporate multiple dictionaries spanning various time periods, leaving it to the operator to choose the appropriate one for OCR of a specific document. Or, when an OCR program comes across an unrecognized word that it cannot find in a dictionary, it can ask for human intervention and use that knowledge in the future. The ideal OCR software should never stop learning and internalizing new words and new glyphs.
Bibliography
Anderson, Niall, Gunter Muhlberger, and Apostolos Antonacopoulos. “Optical Character Recognition: IMPACT Best Practice Guide.” https://www.digitisation.eu/download/website-files/BPG/OpticalCharacterRecognition-IBPG_01.pdf.
Chaudhuri, Sukanta, and Dibyajyoti Ghosh. “The Bengali Writing System: Fonts and OCR.” In Bichitra: The Making of an Online Tagore Variorum, edited by Sukanta Chaudhuri, 19–20. Heidelberg: Springer, 2015.
Fiormonte, Domenico, Desmond Schmidt, Paolo Monella, and Paolo Sordi. “The Politics of Code: How Digital Representations and Languages Shape Culture.” SciForum, June 19, 2015. https://sciforum.net/paper/view/conference/2779.
Habib, S. M. Murtoza. “Bangla Optical Character Recognition.” PhD diss., Dhaka, BRAC University, 2014.
Kibria, Muhammad Golam, and Al-Imtiaz. “Bengali Optical Character Recognition Using Self Organizing Map.” In 2012 International Conference on Informatics, Electronics & Vision (ICIEV), 764–69. IEEE, 2012. doi:10.1109/ICIEV.2012.6317479.
Miller, Larisa K. “All Text Considered: A Perspective on Mass Digitizing and Archival Processing.” American Archivist 76, no. 2 (2013): 521–41.
Monella, Paolo. “Scritture dimenticate, scritture colonizzate: sistemi grafici e codifiche digitali nelle culture araba e indiana” [Forgotten scripts, colonized scripts: graphic systems and digital encodings in Arab and Indian cultures]. Paper presented at conference “Ricerca scientifica, monopoli della conoscenza e Digital Humanities: Prospettive critiche dall’Europa del Sud,” Università Roma Tre, Rome, October 25, 2018. http://www1.unipa.it/paolo.monella/scritture2018/abstract/extended_abstract_eng_v1.0.html.
Olsen, Mark, and Alice Music McLean. “Optical Character Scanning: A Discussion of Efficiency and Politics.” Computers and the Humanities 27, no. 2 (1993): 121–27.
Pal, U., and B. B. Chaudhuri. “Indian Script Character Recognition: A Survey.” Pattern Recognition 37 (2004): 1887–99. http://library.isical.ac.in:8080/jspui/bitstream/10263/2819/1/Binder1.pdf.
Pirker, Johanna, and Gerhard Wurzinger. “Optical Character Recognition of Old Fonts: A Case Study.” IPSI BgD Transactions on Advanced Research 12, no. 1 (2016): 10–14. http://ipsitransactions.org/journals/papers/tar/2016jan/p3.pdf.
Springmann, Uwe, and Anke Lüdeling. “OCR of Historical Printings with an Application to Building Diachronic Corpora: A Case Study Using the RIDGES Herbal Corpus.” Digital Humanities Quarterly 11, no. 2 (2017). http://www.digitalhumanities.org/dhq/vol/11/2/000288/000288.html.
Tanner, Simon, Trevor Muñoz, and Pich Hemy Ros. “Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library’s 19th Century Online Newspaper Archive.” D-Lib Magazine 15, no. 7–8 (2009). http://www.dlib.org/dlib/july09/munoz/07munoz.html.