Upload
impact-centre-of-competence
View
385
Download
2
Embed Size (px)
Citation preview
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
Summary
I. General presentation............................................................................... 2
II. Binarisation ............................................................................................. 2
III. Segmentation ........................................................................................... 3
IV. OCR Recognition .................................................................................... 4
V. Sequencer ................................................................................................. 5
VI. Post-OCR correction with Spellchecking ............................................. 6
VII. Pictures Treatment/Export .................................................................... 7
VIII. Export of content: ................................................................................... 7
IX. Contact ..................................................................................................... 8
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
I. General presentation
B.I.T. has developed an adaptive OCR solution called BIT-Alpha.
This semiautomatic adaptive OCR is able to adapt itself to all types of text,
independently of their language, typeface or age.
Specifically developed for the treatment of historical and heritage documents,
BIT-Alpha allows scientific research and access to content.
BIT-Alpha is a tool containing the whole workflow:
Binarisation
Segmentation
OCR recognition
Post OCR correction with spellchecking
Picture processing/Export
Export of content
II. Binarisation
3 Binarisation modes in BIT-Alpha:
A Binarisation through Threshold ideal for Newspapers
BIT-Alpha analyses the document by domains/fields so the Binarisation will not
be the same at the bottom, top or left right corner… Through this domains/fields
analysis instead of a global analysis of the whole document, the binarisation will
adapt to the different contrasts of the document.
A Binarisation through the “Niblack” algorithm
BIT-Alpha is analyzing the contrast variance around each letter. In this respect
BIT-Alpha is able to make the difference between a letter and a color spot close
to a letter and therefore is able to eliminate the background noise without
eliminating parts of a letter.
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
BIT-Alpha does the variance analysis over neighborhoods and so determines if a
pixel is part of a text area, non-text area, interline or a picture.
A Binarisation based on an algorithm develop by B.I.T.
Thanks to this very advanced spectral-decomposition algorithm, BIT-Alpha is
able to redraw/reconstruct damaged letters, as if BIT-Alpha were choosing an
optimal paint brush (fine or large). It also allows to maintain very fine traits of
characters which may be deleted by other algorithms.
Those binarisation allows to prepare the document as best as possible in order to
get the best OCR results that are possible for these historic/ heritage documents.
III. Segmentation
BIT-Alpha is segmenting titles, sub-titles, pictures, picture comments, chapters
and articles, for example in Newspapers:
Fraktur dated 1805 at 1944: segmentation of title, sub-titles and chapters
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
During segmentation BIT-Alpha is detecting each line, for each line each word
and for each word each character individually.
Note that Bit-Alpha can output the position of each character (for example into
an alto file).
IV. OCR Recognition
Developed for the processing of historical/ heritage documents, BIT-Alpha is an
adaptive OCR able of adapting itself to all types of text, independently of their
language, typeface or age.
Character learning can be done manually and automatically:
Manually
Training with human action:
Memory storage of characters’ digital signatures.
As the “image” of a character is much heavier than its digital signature, BIT-Alpha
has the ability to create bigger data bases than tools saving “images” of characters.
Automatic
Training without human action:
BIT-Alpha can learn the characters automatically from the text to be
processed. During a Batch process, BIT-Alpha is reading and recognizing
characters already known those characters which are recognized with high
reliability are then used to train the OCR engine. Thereby, BIT-Alpha’s
reliability rates will be increase with each processed page.
A spellchecking database which is adapted to the type of documents that
are to be treated (for example Latin database) can be loaded into BIT-
Alpha. If BIT-Alpha recognizes a word from the database, BIT-Alpha
learns all the character constituting this word automatically. BIT-Alpha can
handle any databases consisting of more than 500 000 words.
BIT-Alpha is able to identify the nature of fonts constituting a text even
when the fonts are mixed-up: Gothic (before 1845), Fracture (after 1845),
Antiqua, Cursive, Greece, Hebrew...
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
BIT-Alpha is able to recognize and read embellished letters, miniatures,
abbreviations and can deal with unusual characters.
V. Sequencer
The Sequencer permits to:
Reconstruct fragmented characters: Sometimes a letter can be fragmented
into two or more parts. BIT-Alpha recognises the fragments of a letter and
reconstitutes it.
Recognition of the right hand side of a lower-case “n” (RKN)
Recognition of the left hand side of a lower-case “n” (LKN)
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
Assembling of the two fragments by the sequencer and reconstruction of
the “n”
Extend abbreviations
In Roman writing a “q” followed by ”;” means “que”.
Correct wrong sequences of letters
When other OCR reads “nnn”, the sequencer corrects that to « mm ». BIT-
Alpha considers the typical sequences of the language of the document
processed and is therefore able to correct incorrect sequences of letters.
For example in Latin the wrong sequence “dcn” is changed into the typical
one: “den”. Another example would be the incorrect sequence “qn” which is
changed changed into the typical one used in Latin: “qu”.
The Sequencer is composed of more than 900 sequences preprogramed in BIT-
Alphas’s data base. By each use, the Sequencer’s data base can be enhanced
and conversely the sequences preprogramed disturbing can be removed.
VI. Post-OCR correction with Spellchecking
BIT-Alpha’s post-OCR correction is based on the “Levenshtein” distance
algorithm. Alpha analyses the edit-distance (different editing operations
correspond to different OCR-mistakes and may have different weights) between
two words, the words in the text and the reference from the database. Thanks to
this technology BIT-Alpha is able to reconstitute words or to separate them with
blanks if needed. For example, in German composed words (very common in
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
German) may be checked by checking the components individually against
known words from the database. Whereas for Latin texts (where composed words
rarely occur) BIT-Alpha separates the words that are sticking together with
blanks.
BIT-Alpha permit to switch off the post-OCR correction and also to adapt how
aggressively it corrects pure OCR results.
VII. Pictures Treatment/Export
BIT-Alpha has very advanced technology for the processing of pictures (for
example in newspapers).
BIT-Alpha is able to detect pictures, to delete interpolate dithered images and to
deliver a high-quality true-color digital image.
Dithered image (binary): Interpolated image without dithering
(greyscale):
VIII. Export of content:
The results can be rendered in different formats, for example:
Txt
Pdf with Highlighting (text as transparent overlay over the original image,
allowing to search, select, copy)
BIT-Alpha creates a lightweight pdf by reducing the resolution (dpi) of the
document in order facilitate exchange of the document or online
publication.
Alto (pixel or 10 de mm)
Tei
Bureau Ingénieur Tomasi S.A.R.L.
Adaptive OCR
Solution
______________________________________________________________________________________ B.I.T. Bureau Ingénieur Tomasi Sarl – 2649 route de Pujaudran - 31530 Mérenvielle -Tel : 05 62 13 95 32 - e-mail : [email protected] N° SIRET 503 902 983 00017 R. C. TOULOUSE
Html
The Html export from BIT-Alpha keeps mathematical formula, pictures,
etc. and positions them at the same place where they were in the original
document.
IX. Contact
Head of sales department
Anne Tomasi,
+33 786 844 845