What is OCR?
Optical Character Recognition or Optical Character Reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text, whether from a scanned document, a photo of a document or a scene-photo. To better understand how OCR works, look at the diagram process in the following picture. From end-user’s side, the OCR process is very simple - just processing the image and get the editable text.
How to use Tessearct.js in Mendix
In this blog, I’ll show you how to use Tesseract.js to build an OCR in the Mendix application.
2) Select Load Language and Initialize Language
3) Result - Fetch text from an image as string type
1) Import tesseract.js and buffer
2) Initialize And Run Tesseract
A Worker helps you to do the OCR-related tasks, it takes a few steps to setup Worker before it is fully functional. The full flow is as follows:
• FS functions // optional
• setParameters // optional
• recognize or detect
Each function is async, so using async/await or Promise is required. When it is resolved, you get a TesseractJob object.
Worker.loadLanguage() loads traineddata from cache or download traineddata from remote and put traineddata into the WebAssembly file system.
langs a string to indicate the languages traineddata to download, multiple languages are concated with +, ex: eng+chi_tra
Worker.initialize() initializes the Tesseract API, and make sure it is ready for doing OCR tasks.
langs a string to indicate the languages loaded by Tesseract API, it can be the subset of the language traineddata you loaded from Worker.loadLanguage.
Worker.recognize() provides the core function of Tesseract.js as it executes OCR
Figures out what words are in the image, where the words are in the image, etc.
Note: Image should be in sufficient high resolution. Often, the same image will get much better results if you upscale it before calling recognize.
image see Image Format for more details.
This is the supported type from Tesseract that could be read by their engine:
• It supports multiple languages, check here for a complete list of supported languages.
• The accuracy is high with normal fonts and clear background
• Accuracy will be low with noisy backgrounds and custom-scripted fonts.
• Tesseract doesn’t support all file formats by itself
• The image quality must reach a certain threshold of Dots per Inch (DPI) points for it to work
After having fun working with Tesseract OCR, I can say that the engine is amazing! It brings the power of OCR to the browser and opens a door of opportunities for developers. Here a list of interesting point from Tesseract in my opinion:
1. Open Source.
2. Easy to use.
3. Good extract result.
4. Support multi-language.
If you are facing some issues and think OCR is your solution, Tesseract would be nice to try! I hope this article is useful for you. Thank you!!