I’m a linguist exploring the fields of computational linguistics. In my research, I focus on the question of how profound linguistic analysis can influence machine learning techniques for the better, both rule-based and statistical models. I like challenging linguistic theories on real data (e.g. based on corpus analysis) and see whether and how they can be useful in application. Respectively, based on linguistic analysis of the data, I want to improve existing theories or built new ones if needed. To sum up, my aim is to apply linguistics to real world problems and phenomena.
It is the aim of this project to develop methods for the segmentation of spoken language. Those methods are to be based on linguistic knowledge and at the same time adequate for the analysis of spoken language on various linguistic levels as well as for the development of tools in computational linguistics. The publication of a guideline for a systematic segmentation of various types of German and French verbal interaction is a milestone of this project.
In the second stage, the possibilities of an automatized segmentation of spoken language corpora based on the segmentation guidelines will be tested and documented. This way the project does not only improve the usability of the three databases involved but also deepen our knowledge about the structures of spoken language.
SegCor is a project funded by the German Research Foundation (DFG) and the French National Research Agency (ANR). This project is a cooperation of the department of Pragmatics of Institute for the German Language (IDS Mannheim), the University of Lyon and the University of Orleans.<
The aim of the project is the automatized annotation of the “Research and Teaching Corpus of Spoken German” (FOLK) with part-of-speech tags. State-of-the-art taggers like e.g. the Treetagger with the STTS (Stuttgart Tübingen Tagset) only show an accuracy of 60% to 80% on our transcripts of spoken German.
Part-of-speech tagging (POS-tagging) of spoken data requires different means of annotation than POS-tagging of written and edited texts. In order to capture the features of German spoken language, a distinct tagset is needed to respond to the kinds of elements which only occur in speech. In order to create such a coherent tagset the most prominent phenomena of spoken language need to be analyzed, especially with respect to how they differ from written language.
The adaption of the tagset was done in cooperation with the workgroup “speech particles” of the STTS workshop “The STTS-Tagset for part-of-speech annotation: state of affairs and perspectives” and Prof. Dr. Hardarik Blühdorn of the Grammar department of the IDS Mannheim.
FOLK is the “Forschungs- und Lehrkorpus Gesprochenes Deutsch (FOLK)” (eng.: research and teaching corpus of spoken German). The project has set itself the aim of building a corpus of German conversations which a) covers a broad range of interaction types in private, institutional and public settings, b) is sufficiently large and diverse and of sufficient quality to support different qualitative and quantitative research approaches, c) is transcribed, annotated and made accessible according to current technological standards, and d) is available to the scientific community on a sound legal basis and without unnecessary restrictions of usage.
(text by Thomas Schmidt 2014)
“LiSe-DaZ – Linguistische Sprachstandserhebung – Deutsch als Zweitsprache” (linguistic learning assessment of German as a second language) was developed by Prof. Dr. Rosemarie Tracy, University of Mannheim, and Prof. Dr. Petra Schulz, University of Frankfurt/Main on behalf of the Baden-Württemberg foundation. With this procedure it is possible to assess the linguistic development of children who are between three and seven years old, both native speakers and speakers of German as a second language. For the first time, the development of the language spaking abilities of the children can be assessed according to various linguistic stages and based on this, remedial teaching can be specifically adjusted to the needs of the children. LiSe-DaZ was standardized with 912 children all over Germany. The tool was published by Hogrefe Verlag in summer 2011.