IMPROVED OCR QUALITY FOR SMART SCANNED DOCUMENT MANAGEMENT SYSTEM

  • Phan Viet Anh Le Quy Don Technical University
  • Nguyen Duy Tung Khanh Le Quy Don Technical University
  • Tran Manh Dat Le Quy Don Technical University
  • Pham Van Dan Le Quy Don Technical University

Tóm tắt

The quality of the document images is a crucial factor for the performance of an Optical Character Recognition (OCR) model. Various issues from the input data hinder the recognition success such as heterogeneous layouts, skewness and proportional fonts. This paper investigated several algorithms for data pre-processing including image deskewing, table and document layout analysis to improve the accuracy of the OCR model and then built an end-to-end scanned document management system. We verified the algorithms using a well-known OCR software namely Tesseract. The experiments on a real dataset shown that our methods can accurately process document images with arbitrary angles of rotation, and different layouts. As a result, the accuracy by words of Tesseract can boost 23% for documents with complex structures. The quality of the output text allows to build a system to store and search documents efficiently.Optical Character Recognition (OCR);

điểm /   đánh giá
Phát hành ngày
2021-05-17
Chuyên mục
Bài viết