Guide to OCR for Indic Scripts Document Recognition and Retrieval

Optical Character Recognition (OCR) is a key enabling technology critical to creating indexed, digital library content, and it is especially valuable for Indic scripts, for which there has been very little digital access. Indic scripts, the ancient Brahmi

  • PDF / 14,312,757 Bytes
  • 334 Pages / 439.37 x 666.142 pts Page_size
  • 11 Downloads / 246 Views

DOWNLOAD

REPORT


For further volumes: http://www.springer.com/series/4205

Venu Govindaraju · Srirangaraj Setlur Editors

Guide to OCR for Indic Scripts Document Recognition and Retrieval

123

Editors Prof. Venu Govindaraju Center for Unified Biometrics and Sensors 520 Lee Entrance Amherst NY 14228 Suite 202 USA [email protected] [email protected]

Srirangaraj (Ranga) Setlur Center for Unified Biometrics and Sensors 520 Lee Entrance Amherst NY 14228 Suite 202 USA [email protected] [email protected]

Series Editor Professor Sameer Singh, PhD Research School of Informatics Loughborough University Loughborough, UK

ISBN 978-1-84800-329-3 e-ISBN 978-1-84800-330-9 DOI 10.1007/978-1-84800-330-9 Springer London Dordrecht Heidelberg New York British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Control Number: 2009934526 © Springer-Verlag London Limited 2009 Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to the publishers. The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant laws and regulations and therefore free for general use. The publisher makes no representation, express or implied, with regard to the accuracy of the information contained in this book and cannot accept any legal responsibility or liability for any errors or omissions that may be made. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Foreword

The original motivations for developing optical character recognition technologies were modest to convert printed text on flat physical media to digital form, producing machine-readable digital content. By doing this, words that had been inert and bound to physical material would be brought into the digital realm and thus gain new and powerful functionalities and analytical possibilities. First-generation digital OCR researchers in the 1970s quickly realized that by limiting their ambitions primarily to contemporary documents printed in standard font type from the modern Roman alphabet (and of these, mostly English language materials), they were constraining the possibilities for future research and technologies considerably. Domain researchers also saw that the trajectory of OCR technologies if left unchanged would exclude a large portion of the human record. Digital conversion of documents and manuscripts in other alphabets, scripts, and cursive styles was of critical importance. Embedded in non-Roman alphabet