Main menu

Pages

Extracting text from images and PDF files using gimagereader

Introduction

        gImageReader software is the simple front-end gui (graphical user interface)  of tesseract-ocr software (an OCR (Optical character recognition) engine with a command line program). This made tesseract more usable by beginners to extract text from images and PDF files. Tesseract is a really powerful ocr engine that supports unicode (UTF-8). Tesseract ocr engine can recognize a lot of languages around 100 languages. Tesseract became an open source software in 2005 (Was owned by HP) and since then it has been developed by google.

Main features of gImageReader software :

gImageReader text-extraction tool has a lot of features including :

  • Importing PDF files , Image files and taking Screenshots to extract text

  • Doing some OCR processing to the PDF file or image

  • Recognition and extraction of the text from the imported PDF or image files

  • Spell-checking and other post-process features

  • Export resulting text to PDF


How to install gImageReader text-extracting tool in ubuntu (and it's derivative .. e.g. I'm using linux lite for demonstration) (For other os check documentation on the github page of gImageReader) :

  • Note: You need install Tesseract language packages for gImageReader tool to be able to process images and PDFs and extract text (This can be found on your software manager) -> Check this for ubuntu
  • I installed the English package of Tesseract on Linux lite using (Package manager)
Tesseract ocr language packages

  •  Then install gImageReader text-recognition software using terminal
  • Open terminal (CTRL + ALT + T)
  • Add the repository (sudo add-apt-repository ppa:sandromani/gimagereader)
  • Update (sudo apt-get update)
  • Then install (sudo apt install gimagereader)
  • This is how gImageReader gui looked on my linux lite OS.
gImageReader - GUI in linux lite

 

How to use gImageReader to process images and PDFs and extract text

  • To import PDF files or images to gImageReader click on the add button on the top left corner , then click on the "add images" button as follows
gImageReader - Import file


  • You can add PDF files or images
  • You can choose the output text to be either Plain text or hOCR

  • You can choose to process the current page of PDF file or multiple pages as following
gImageReader

  • If you choose multiple pages of pdf file to be processed , gImageReader will ask you to enter a range of pages to process, and choose a layout for OCR processing




Comments