Workflow for creating structured tei documents from Transkribus layout+text recognition
This project is based on page2tei by Dario Kampkaspar and extended with several waves of string replacements and XSLT transformations.
exports/: Exports from Transkribus in the PAGE format
To process individual parts (as groups of pages), themets.xmlfile can be duplicated and reduced to the relevant pages.guidelines/: Documentation for the TEI format of certain phenomenonslanguage_data/: Plain wordlists in a number of languages. To enable more languages, add lists here and adaptreplace_hyphens(xml_data, language="any")inreplacements.pyout/: Default output folderpage2tei/: Project code from dariok/page2tei by Dario Kampkasparxsd/: Validation for the documents in REMSxslt/: Stylesheetsbibliography.xsl: Turn list entries into bibliography entries and mark monograph titlescheckpara.xsl: Testing: Check for paragraph types in unexpected contextcollect-blocks.xsl: Apply standardized blocks for REMS documentsdisconnect-style-and-type.xsl: Merge tags that just differ by layout but keep layout information (REMS documents)expand-hi.xsl: Usehiand types instead of different elements for markupid-to-div.xsl: Add ID to REMS documentsindent.xsl: Pretty printjoin-paragraphs.xsl: Join paragraphs across page breaksmove-footnotes.xsl: Move footnotes from page end to the footnote markpage-numbers.xsl: Set page number as attribute in page breakspostprocess-page2tei.xsl: Processing of additional styles in the same way as page2teiremove-lb.xsl: Remove linebreaks in insignificant positionremove-pb.xsl: Remove pagebreaks without numbersremove-position-data.xsl: Remove the elementfacsimilefrom the xmlsimplify-hi.xsl: Turnhiinto elements without attributes to join the across line breaksstring-pack.xsl: See page2teiwoelfflin-elements.xsl: Replacement of TEI elements to match the flavour of the project
bibliography.py: Cascade of operations for the bibliography in REMSdocuments.py: Cascade of operations for the main parts in REMSgedanken.py: Cascade of operations for Heinrich Wölfflins «Gedanken zur Kunstgeschichte» (1941)introduction.py: Cascade of operations for the introduction of REMS, and any text without special markupreplacements.py: Replacements using regular expressions as catalog of functionssimplify.py: Cascade of operations for a simple TEI documenttransform.py: XSLT transformations as catalog of functionsworkflow.md: Step by step description of the workflows
- Copy
saxon-he-10.5.jarinto the root of this repository - Make sure, the following folders exist in the project root:
exports,out,temp
- Extract your export from Transkribus to
exports - Chose among the suggested cascades or add your own.
introduction.pyis the most generic one - Run it like:
python3 introduction.py -i exports/your_export/mets.xml -o out/your_output.xml