Filter by type:

Sort by year:

A Syntax-Based Scheme for the Annotation and Segmentation of German Spoken Language Interactions (2018)

Swantje Westpfahl, Jan Gorisch
Conference Paper In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), 109-120. Workshop at COLING 2018, Santa Fe, New Mexico, 25.-26.08.2018.

Abstract

Unlike corpora of written language where segmentation can mainly be derived from orthographic punctuation marks, the basis for segmenting spoken language corpora is not predetermined by the primary data, but rather has to be established by the corpus compilers. This impedes consistent querying and visualization of such data. Several ways of segmenting have been proposed, some of which are based on syntax. In this study, we developed and evaluated annotation and segmentation guidelines in reference to the topological field model for German. We can show that these guidelines are used consistently across annotators. We also investigated the influence of various interactional settings with a rather simple measure, the word-count per segment and unit-type. We observed that the word count and the distribution of each unit type differ in varying interactional settings and that our developed segmentation and annotation guidelines are used consistently across annotators. In conclusion, our syntax-based segmentations reflect interactional properties that are intrinsic to the social interactions that participants are involved in. This can be used for further analysis of social interaction and opens the possibility for automatic segmentation of transcripts.

A Study on Gaps and Syntactic Boundaries in Spoken Interaction. (2018)

Thomas Schmidt, Swantje Westpfahl
Conference Paper In: Proceedings of KONVENS 2018, 40-49. Wien, Austria, 19.-21.09.2018.

Abstract

We present a study on gaps in spoken language interaction as a potential candidate for syntactic boundaries. On the basis of an online annotation experiment, we can show that there is an effect of gap duration and gap type on its likelihood of being a syntactic boundary. We discuss the potential of these findings for an automation of the segmentation process.

Diskursmarker aus korpuslinguistischer Sicht – POS-Annotation von Diskursmarkern in FOLK (2017)

Swantje Westpfahl
Journal Paper In: Blühdorn, Hardarik / Deppermann, Arnulf / Helmer, Henrike / Spranz-Fogasy, Thomas (Hg.): Diskursmarker im Deutschen. Reflexionen und Analysen Göttingen: Verlag für Gesprächsforschung, 285-309.

Abstract

How can discourse markers be identified for research in a corpus of transcripts of spoken language? What is part-of-speech tagging and how does it work? In this paper we show how part-of-speech tagging was developed for the Research and Teaching Corpus of Spoken German (FOLK) with respect to phenomena typical of spoken language and using discourse markers as an example. We presentdiscourse markers from the perspective of machine learning, i.e. how the POS category of discourse marker can be defined in such a way that it can be annotated automatically by a POS tagger. Finally, this paper illustrates how discourse markers other than those annotated by the POS category can be found in the database.

STTS 2.0. Guidelines für die Annotation von POS -Tags für Transkripte gesprochener Sprache in Anlehnung an das Stuttgart Tübingen Tagset (STTS) (2017)

Swantje Westpfahl, Thomas Schmidt, Jasmin Jonietz, Anton Borlinghaus
Report Internet Document

Abstract

Die Guidelines sind eine Erweiterung des STTS (Schiller et al. 1999) für die Annotation von Transkripten gesprochener Sprache. Dieses Tagset basiert auf der Annotation des FOLK-Korpus des IDS Mannheim (Schmidt 2014) und es wurde gegenüber dem STTS erweitert in Hinblick auf typisch gesprochensprachliche Phänomene bzw. Eigenheiten der Transkription derselben. Es entstand im Rahmen des Dissertationsprojekts „POS für(s) FOLK – Entwicklung eines automatisierten Part-of-Speech-Tagging von spontansprachlichen Daten“ (Westpfahl 2017 (i.V.)).

FOLK-Gold ― A gold standard for part-of-speech-tagging of spoken German (2016)

Swantje Westpfahl, Thomas Schmidt
Conference Paper In: Proceedings of the Tenth Conference on International Language Resources and Evaluation (LREC’16), Portorož, Slovenia. Paris: European Language Resources Association (ELRA), pp. 280-287.

Abstract

In this paper, we present a GOLD standard of part-of-speech tagged transcripts of spoken German. The GOLD standard data consists of four annotation layers – transcription (modified orthography), normalization (standard orthography), lemmatization and POS tags – all of which have undergone careful manual quality control. It comes with guidelines for the manual POS annotation of transcripts of German spoken data and an extended version of the STTS (Stuttgart Tübingen Tagset) which accounts for phenomena typically found in spontaneous spoken German. The GOLD standard was developed on the basis of the Research and Teaching Corpus of Spoken German, FOLK, and is, to our knowledge, the first such dataset based on a wide variety of spontaneous and authentic interaction types. It can be used as a basis for further development of language technology and corpus linguistic applications for German spoken language.

User, who art thou? User Profiling for Oral Corpus Platforms. (2016)

Christian Fandrych, Elena Frick, Hanna Hedeland, Anna Iliash, Daniel Jettka, Cordula Meißner, Thomas Schmidt, Franziska Wallner, Kathrin Weigert, Swantje Westpfahl
Conference Paper In: Proceedings of the Tenth Conference on International Language Resources and Evaluation (LREC’16), Portorož, Slovenia. Paris: European Language Resources Association (ELRA), pp. 280-287.

Abstract

This contribution presents the background, design and results of a study of users of three oral corpus platforms in Germany. Roughly 5.000 registered users of the Database for Spoken German (DGD), the GeWiss corpus and the corpora of the Hamburg Centre for Language Corpora (HZSK) were asked to participate in a user survey. This quantitative approach was complemented by qualitative interviews with selected users. We briefly introduce the corpus resources involved in the study in section 2. Section 3 describes the methods employed in the user studies. Section 4 summarizes results of the studies focusing on selected key topics. Section 5 attempts a generalization of these results to larger contexts.

Tagset und Richtlinie für das Part-of-Speech-Tagging von Sprachdaten aus Genres internetbasierter Kommunikation. Guideline document from the Empirikom shared task on automatic linguistic annotation of internet-based communication (EmpiriST 2015)

Michael Beißwenger, Thomas Bartz, Angelika Storrer, Swantje Westpfahl
Report Internet Document

STTS 2.0? Improving the Tagset for the Part-of-Speech-Tagging of German Spoken Data. (2014)

Swantje Westpfahl
Conference Paper In: Lori Levin und Manfred Stede (eds.): Proceedings of LAW VIII - The 8th Linguistic Annotation Workshop. Dublin, Ireland: Association for Computational Linguistics and Dublin City University, pp. 1–10.

Abstract

Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language. First evaluations have shown that the most prominent cause (over 50%) of errors in the existing automatized POS-tagging of transcripts of spoken German with the Stuttgart Tübingen Tagset (STTS) and the treetagger was the inaccurate interpretation of speech particles. One reason for this is that this class of words is virtually absent from the current STTS. This paper proposes a recategorization of the STTS in the field of speech particles based on distributional factors rather than semantics. The ultimate aim is to create a comprehensive reference corpus of spoken German data for the global research community. It is imperative that all phenomena are reliably recorded in future part-of-speech tag labels.

POS für(s) FOLK – Part of Speech Tagging des Forschungs- und Lehrkorpus Gesprochenes Deutsch. (2013)

Swantje Westpfahl, Thomas Schmidt
Journal Paper In: Journal for Language Technology and Computational Linguistics, H. 1, S. 139-156.

Abstract

Im Rahmen des FOLK-Projekts (Forschungs- und Lehrkorpus Gesprochenes Deutsch), das am Institut für Deutsche Sprache (IDS) ein großes wissenschaftsöffentliches Gesprächskorpus aufbaut, soll mit Hilfe des TreeTaggers (Schmid 1995) und des Stuttgart-Tübingen-Tagsets (STTS), (Schiller et al. 1999) ein automatisiertes Part-of-Speech-Tagging (POS-Tagging) für Spontansprache ermöglicht werden. Zuerst nur auf FOLK angewendet, soll dieser Tagger später auch für weitere Korpora spontansprachlicher Daten in der Datenbank für Gesprochenes Deutsch (DGD), (Institut für Deutsche Sprache) genutzt werden. Da das Forschungs- und Lehrkorpus kontinuierlich ausgebaut wird, muss das POS-Tagging aus Effizienzgründen mittelfristig vollautomatisch erfolgen. Dabei wird eine Fehlerquote von unter 5 Prozent angestrebt. Weil sowohl das Tagset als auch der Tagger für geschriebene Sprache konzipiert bzw. trainiert wurden und beim automatisierten Taggen der Transkripte die Fehlerquote bei fast 20 Prozent lag, muss eine Anpassung sowohl des Tagging-Verfahrens als auch des Tagsets an Spontansprache vorgenommen werden. Aus diesem Grund wurden die Fehler, die bei einem ersten Versuch des automatisierten Taggings dreier Transkripte des Korpus mit dem TreeTagger und dem STTS auftraten, auf ihre Ursachen hin analysiert. Daraufhin konnten Vorschläge zur Verbesserung des POS-Taggings in Hinblick auf eine Anpassung des Tagsets sowie des Tagging-Verfahrens gemacht werden.

Gesprächsforschung (2013)

Lucia Weiger, Swantje Westpfahl
Report In: Gesprächsforschung – Online-Zeitschrift zur verbalen Interaktion, 15. Jg. S. 73-86 - Mannheim: Verlag für Gesprächsforschung, 2014. (Gesprächsforschung 2014). ISSN: 1617-1837.