CODA – CATCHPlus Open Document Annotation | Open Annotation Phase II Experiment

The CODA project is an Open Annotation Phase II experiment that centers around two main use cases concerning transcription of scanned handwritten historical documents. It builds on several software components implemented for CATCH and CATCHPlus projects: an annotation repository, document annotation tool, workspace services, entity detection software and image processing software for detecting bounding boxes for written lines.

The first use case deals with a collection of 40.000 high-resolution scanned pages from the index books of the Queen’s Cabinet: this is the archive of the Dutch state and therefore of high historic value. It is maintained by the Dutch National Archive. The scans are partly manually transcribed at the level of lines and words, the rest of the collection is made searchable on basis of ink shapes exploiting these annotations for advanced machine learning technology of one of the CODA participants. CODA converted existing transcriptions to OAC annotations and made them searchable. It addressed the issue of annotation of annotation body text by applying entity recognition to the transcriptions. The latter was done by wrapping existing NER software for historic Dutch texts with an OAC compliant web service. Finally, will explores composite annotation targets.

The second use case supports human annotators of the Sailing Letters by automatic detection of bounding boxes for lines as basis for further manual transcription. Also for this use case layering of annotations and OAC compliant web services play a role. The Sailing Letters are a collection of 38.000 historical document pages that were stolen by pirates in the 17^th and 18^th centuries. The collection is manually annotated by volunteers and crowd sourcing.

Home