OCR, Optical Recognition System, a boon to modern publishing

Kundalini and writing arts are related to each other

Readers will think that what is the purpose of self-publishing and website-creation in the Kundalini-website. In fact, Kundalini-seeker should also have practical experience of self-publishing and website building. This is because, after the Kundalini-activation or Kundalini-awakening, the mind becomes like a flood of thoughtfulness. In that case, the person can create an excellent book and an excellent website. Together, it can also avoid the negativity of the worklessness. The same happened with Premyogi vajra.

How did I access OCR technology

I encountered the OCR technique when I was trying to make an e-book form of a paper book written by my father about seven years ago.  The name of that book was ‘Solan ki sarvhit saadhna/सोलन की सर्वहित साधना’. Fortunately, the publisher found out the soft copy of that book. This saved me from scanning the book. Together, there is probably less inaccuracy in the e-book made with soft copies. That book was in PDF format. At first, I started taking the help of Online PDF Converter. I tried a variety of converters, as well as the Google Drive Converter. However, the word file that was coming after being converted by all of those was completely flawed. The book was looking like a Chinese book, not the original Hindi book. Then I used the PDF Element. There was facility for only few pages in free plan. The pages were extracted from the PDF file to the word file, but the pages were along with original decorations like straps, palettes, flowers etc. I was not able to remove those things of decoration. Some things were removable in word, but not all. It was also a labor intensive work. Even the quality of the letters was not better. I thought that may be something would be done with the plan of purchase. However, when I saw the value of it, I left it completely behind. Because its minimum annual price was about INR 3000-4000.

I have a lot of help with free online file converter

For several months, my plan was lying in cold storage. Then when I got some free time, I searched on Google. I had read about the OCR before, but I never fully understood it. Then I found out in a web post that the book has to be scanned for it so that every page of the book comes in the form of a separate picture. As soon as I was preparing for the scan of the book, I came to know that if the book is available as a PDF file, it could be converted directly into a picture file. I tried many online converters by searching on Google with keywords ‘PDF Image Extraction’. Among those converters, I found the best available converter on Smallpdf.com. Extractor on ilovepdf.com is also good. I uploaded whole of the book-file in one go. After conversion whole of the converted book-file was downloaded. In it, the entire book-file came in the download folder of the computer in the form of serial wise pictures (jpg images). All the pictures were in a zipped (compressed) folder. By making that folder unzipped (with WinZip etc. software), all images came in a simple folder.

There are very few OCR’s available for Hindi language

Then I started finding online app on Google that could convert those pictures into the word doc (OCR app). There were many OCRs that did not offer Hindi language facility. Finally, I found the best online OCR for Hindi available on the website http://www.i2ocr.com. That was free. However, I was trying to upload the folder containing books, but it did not happen. Then I tried to upload all the pictures together by selecting all the pictures. However, even that did not happen. Then I found out in a web post that Batch Extraction OCRs are commercial, and are not available freely. Therefore, I had to convert pictures one by one. Like the book-pictures, the converted doc files also came folded serial wise in a folder.

Formatting a word file created with image extraction

Then I copied all the doc files into a single doc file in right sequence. However, the word-lines were unequal in length in the doc file, which were not being corrected even with the command of Justify alignment. Then I read in a web post that in the ‘Find’ section of ‘find-replace’ of MS Word, type the ^ p in the find section(^ symbol is printed by pressing the shift key and 6 number numeric key together), and insert a single blank space in the replace section. With the command of ‘Replace All’, all becomes all right. That is what happened. Thus, the e-book was prepared in this way.

It should be noted that if there are plenty of small word files to be clubbed together, then help of ‘insert’ of ms word should be taken. On clicking ‘insert’ button, search the ‘object’ button and click on triangle at corner of this. Now click ‘text from file’ on dropdown menu. A new browse-window will pop up. Select word files on it to be clubbed. Keep in mind that files will be clubbed in order of selection. It means, first file in selected group will come first in combined word file and so on. I advise to club only maximum of 10 files at a time for I feel it can produce error if very large number of files is selected together. In fact, many files were not being downloaded in word format after converting. I was doing OCR in Hindi language. Those files were being downloaded in the text format. However, the files with the text format were opening only in Notepad, not in wordnote. It is a loss to download text files that those cannot be clubbed together by giving commands like insert-object etc. as in word files. One has to copy-paste every file separately.

Final File Correction

There were two words attached to each other without space in that book at many places. For example, the words ‘fruit is’ of the original book became ‘fruitis’. With a little manual correction, it was all right. Page break, line break, heading shape etc. were provided at the appropriate places by keeping the paper book in front, so that the e-book looks like a completely original book. Some pictorial pages of cover and initial part of book were inserted directly into the e-book. For editing these cover photos, I used online photo editor of ‘photojet’. However, one needs to share it on face book before downloading the edited image. Online editor of pixlr.com is also good. Instead of copying images directly, the help of ‘Insert- picture’ of MS Word was taken, because copying image directly to the word file does not cause that to appear in the e-book at all times.

Some special things to note in OCR

Before scanning the book, see how much old the book is. OCR of old books is not available. Opening the book’s bindings is required so that each page is scanned separately. By scanning the bound book by folding it, the scans are not good for paper-margins, so they cannot get good OCR. Later on, you can re-bind the book. Scanning the double page does not even make the OCR. The page will have to be placed on the scanner accordingly, as usually a single page is placed. The length of the page is placed in the direction of the length of the scanner. The page should be written like a normal book page, that is, the lines of letters cover the width of the page. The more straightforward the page on the scanner, the better the OCR. Therefore, the page should be attached to the backside plastic boundary of scanner-glass length. The page itself comes directly straight with this. In lengthwise, page should come in the middle of the scanner.

To improve the file, try the easy option before doing OCR.

Many times, there is no need to do OCR, because converting fonts work. Universal fonts are Unicode. I converted a pdf article typed in a krutidev font into a word-article, but its characters were not being read. Then I put the file in online font converter and converted its krutidev font to Unicode. Then the alphabets were readable. Only 1-2 types of letters were wrong, that too only at few places. I corrected the article with a little hard work. The effort was much less than the hard work needed to do OCR. Yet OCR is a lot easier than re-typing.

Future technology ‘hand text recognition’

Further technique is recognizing hand-written text by the OCR. This is called ‘hand text recognition’. However, it has not evolved completely. Search is on. However, this technique works by inserting letters one by one in predefined boxed compartments in the paper format. That is why the introduction-forms in recruitment etc. are hand-filled in the boxed format.

If you liked this post then please click the “like” button, share it, and follow this blog while providing your e-mail address, so that all new posts of this blog could reach to you immediately via your e-mail. Do not forget to express your opinion in the comments section.

कृपया इस पोस्ट को हिंदी में पढ़ने के लिए इस लिंक पर क्लिक करें (ओसीआर, ऑप्टिकल रिकोग्निशन सिस्टम, आधुनिक प्रकाशन के लिए एक वरदान)

Published by

demystifyingkundalini by Premyogi vajra- प्रेमयोगी वज्र-कृत कुण्डलिनी-रहस्योद्घाटन

I am as natural as air and water. I take in hand whatever is there to work hard and make a merry. I am fond of Yoga, Tantra, Music and Cinema. मैं हवा और पानी की तरह प्राकृतिक हूं। मैं कड़ी मेहनत करने और रंगरलियाँ मनाने के लिए जो कुछ भी काम देखता हूँ, उसे हाथ में ले लेता हूं। मुझे योग, तंत्र, संगीत और सिनेमा का शौक है।

One thought on “OCR, Optical Recognition System, a boon to modern publishing”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s