Corpus Linguistics for Digital Humanities. Introduction to Methods and Tools

The course can be attended without specific prerequisites.

Corpora, i.e. collections of linguistic data (texts or conversations), are a fundamental asset of digital humanities research. A ubiquitous task for humanists is to explore questions related to language:

  • Which forms of the verb to be occur in a given text?
  • What are common ways to refer to certain entities, such as kings or VIPs?
  • Is it common that number words are preceded by articles? Do elaborate number words (two thousand three hundred and four) or numbers (12449223) occur at all?
  • What types of phrase do occur as direct object in philosophical texts?
  • What kinds of ellipsis occur in section headings?
  • What kinds of speech acts occur in modern conversations?
  • Are texts by women longer (or shorter) than texts by men?

To answer such questions, it is necessary to select and prepare data. We will discuss different approaches to compilation and annotation of corpora. The methodology stems from computational and corpus linguistics, but is used more widely in processing linguistic data in the digital humanities.

Such questions as those sketched above can only be approached if an adequate selection of texts is available; for instance, one will not find much evidence about conversational practices in parliament speeches or mathematical papers. Hence we will first be concerned with criteria and methods for compiling corpora: selecting texts based on extra- and intralinguistic criteria, including property rights.

Furthermore, linguistic data must be described by metadata, so that one can find e.g. utterances by female native speakers of Southern German in the second half of the 20th century about political developments in an informal setting. Approaches to metadata will be explored.

It is also often useful to annotate linguistic data with respect to: pragmatic structure such as speech acts or rhetorical relations; semantic elements such as named entities, e.g. all the kings and queens in Europe or there republican counterparts; linguistic information, e.g. dependencies, parts of speech of words, or lemmatizing (reducing went to go). Moreover, annotating information on text structure or layout may be useful, e.g., text in headings, footers, footnotes, italics or bold face.

Some of these annotations can be carried out (partly) automatically. We will discuss what tools exist and are available.

Depending on the type of processing and annotation, questions such as the ones given above are more or less difficult to answer as finding the corresponding data and counting them can be difficult or very easy. This course will present fundamental techniques for searching in corpora, viz.

  • searching for single word forms,
  • searching with wild cards or distance operators,
  • regular expressions to search for similar word forms,
  • searching in hierarchical annotation to find syntactic or semantic configurations.

You will learn about different query languages used for searching in corpora and, time permitting, also consider simple statistical evaluations of texts.

The course will be mainly concerned with textual corpora, but as searching on speech or multimodal corpora is generally carried out on the transcription and annotation layers, it will also be useful to researchers dealing with such data.

To sum up, we will approach:

  • corpus construction and annotation in the first week and
  • corpus search and evaluation in the second week.