Semantic Writer Identification for Historical Documents of Swiss Foreign Politics

Contact: Marco Peer

Overview

This thesis aims to explore semantic writer identification for the historical documents available at DODIS. The corpus consists of approximately 50,000 documents related to Swiss foreign policy. A key research question is if there is sufficient semantic variety in the documents to reliably capture individual writing styles.

Existing datasets in related work present limitations: some are small (e.g., IMDB62), while others are topically diverse (e.g., social media posts), making it difficult to isolate semantic features that characterize individual writers. In contrast, our dataset offers approximately 25,000 documents with known authors (8000 writers with three or more documents). The documents are narrowly focused on topics in the domain of Swiss foreign policy.

Objectives

  • Review state-of-the-art methods for writer identification and style extraction
  • Select an appropriate baseline method from related work (e.g., transformer-based architectures, LLMs)
  • Implement and evaluate semantic writer identification techniques on the DODIS corpus

Methodology

  1. Literature Review – Examine current approaches in writer identification, style extraction, and semantic feature modeling
  2. Implementation – Implementation of semantic analysis/writing style extraction for writer identification
  3. Evaluation – Benchmark the pipeline on the DODIS dataset, with optional comparison to other datasets in related work
  4. Reporting – Produce the final thesis, presentation, and optional publication

Skills Required

  • Proficiency in Python programming
  • Experience with deep learning frameworks (preferably PyTorch)
  • Interest in historical document analysis and natural language processing

Created: 01.09.2025