
Rūta Petrauskaitė
Vytautas Magnus University
Rūta Petrauskaitė is a professor at the Department of Lithuanian Studies. Currently she acts as the director of the Institute of the Digital Resources and Interdisciplinary Research (SITTI) at Vytautas Magnus University.
In the last decade she has been a vice-president of the Research Council of Lithuania and the Chair of the Committee of Social Sciences and Humanities. Internationally she got involved in the activities of the Common Language Resources and Technology Infrastructure (CLARIN), Science Europe Research Data working group and European Open Science Cloud (EOSC).
Her research interests comprise a range of topics from linguistics to discourse analyses. She initiated and supervised compilations of the first big corpora of the Lithuanian language and corpus based research in a few fields of linguistics.
Rūta Petrauskaitė is a proponent of data-driven research, Open Science and data sharing initiatives.
https://www.vdu.lt/cris/entities/person/ruta-petrauskaite
Abstract
Corpora and Data. From John Sinclair to Artificial Intelligence
Last year we celebrated thirty years anniversary of corpus linguistics in Lithuania. The advent of the new trend was gradual, nevertheless, groundbreaking. Our participation in EU projects TELRI I and TELRI II speeded up compilation of the first corpora for the Lithuanian language that was followed by corpus-based research. To deal with corpora we badly needed new methodological approaches, happily, by that time they were already available in publications by John Sinclair as well as his activities related to COBUILD. TELRI was beneficial due to co-operation with linguists from other countries but most of all due to the revolutionary ideas and personality of John Sinclair.
John Sinclair was ahead of time in his attempts to describe how meaning is created in human language. His holistic approach is based on a few key concepts of lexical items juxtaposed to ortographic words or extended units of meaning, comprising elements of lexis (collocation), grammar (colligation), semantics (semantic preference) and pragmatics (attitudinal meaning). His effort to do away with the historical split of lexis and grammar and to show the close relation between the two types of pattern more than thirty year ago was truly astonishing.
Main cornerstones of his language theory included: a) reunification of grammar as structure and lexis as vocabularly for a language for creation of meaning in text, i.e., form and meaning in language that cannot be separated; b) the importance of co-text and context for generating and understanding the meaning; c) reliance on corpora as large amounts of language data for pattern detection instead of testing hypothesis, i.e., corpus-driven instead of corpus based approach; d) reluctance to trust man-made consensus grammar based annotation.
John Sinclair passed away in 2007, before neuronic revolution, so he did not witness its main developments that went along the same lines as he suggested for corpus linguistics. Major steps in the direction of AI were as follows: 1990 marked the shift from rule- to statistics-based methods and machine learning. 2014 brought neuronic language technologies, that caused a major paradigm shift in natural language processing, specifically the shift from rule-based approaches to data-driven approaches. The focus has increasingly moved toward high-quality corpus modelling rather than relying on explicit grammar rules or predefined linguistic annotations. Large language models like GPT, released three years ago represent this evolution: they were fundamentally data-driven but increasingly incorporating techniques to inject linguistic knowledge and structure where it is beneficial. High-quality corpora enabled models to learn language patterns effectively and this is how AI learned languages – by encopassing broad co-text and capturing the richness and complexity of natural language.