Scandiasyn - IceDiaSyn

Home

Network

Corpus/Database

Project description

1. The Scandinavian "parent project" Scandinavian Dialect Syntax (ScanDiaSyn)

The general objectives of ScanDiaSyn have been described as follows:

to conduct a systematic and coordinated investigation of syntactic variation across Scandinavian languages and dialects
to create databases of transcribed and tagged material generally available and easily accessible for research through a user friendly interface on the internet
to initiate theoretically driven research on dialectal syntactic variation in the Scandinavian domain
to cooperate with other existing dialect syntax projects in Europe and elsewhere so as to enhance the understanding of linguistic diversity and microvariation at a general level

There are 9 groups of researchers involved in the Scandinavian project (Tromsų (NO), Aarhus (DK), Reykjavķk (IS), Tórshavn (FO), Trondheim (NO), Helsinki (FI), Copenhagen (DK), Lund (SW) and Oslo (NO)) and the central idea is that they will all be planning and conducting research on syntactic variation in their respective languages and dialects and that this resesarch will be "systematic and coordinated" in the sense that the methods of data collection, elicitation and storage (in databases) will be comparable and compatible to the extent possible and feasible. While all kinds of "syntactic variation" are in principle of interest for the project, the following overarching topics have been outlined at preparatory meetings for the ScanDiaSyn as a whole (they are described here using common but fairly technical terms that are largely taken from generative syntax but to some extent going back to the Danish linguist Diderichsen in the 1940s). It should be noted that this division or grouping together of topics is simly made for convenience – it has no particular theoretical status:

The "pre-field" in the sentence or clause (“forfeltet”, CP, Left Periphery), for example: subject/verb inversion (V2), question formation, complementizers, topicalization etc.
The "mid-field" (“midtfeltet”, IP), for example: V-to-I movement, Object Shift, placement of adverbs/particles, subject/verb agreement, finiteness, case marking of subjects, expletive constructions etc.
The "end-field" (“sluttfeltet”, VP), for example: the syntax of verb particles, VP-syntax, complementation, subcategorization, case marking of objects etc.
Extraposition, for example: heavy NP/CP shift, Left Dislocation, Hanging Topics, tags and pragmatic particles etc.
Nominal expressions, for example: article syntax, possessor constructions, pronominal systems, binding etc.

Although the topics listed here are very general (and partially overlapping), this broad thematic organization is nevertheless quite likely to be useful when the findings of ScanDiaSyn are applied to a wider European context, i.e. when the syntax of the Scandinavian dialects is compared to the syntax of the dialects in other European languages. There is for instance a partial overlap between the five major ScanDiaSyn topics and the major topics of the Dutch dialect syntax project SAND (Syntactische Atlas van de Nederlandse Dialecten).

Now it is of course impossible for all the groups to investigate all the topics listed above in detail. In addition, some of the topics will not be equally interesting in all the languages or dialects – and there may also be additional topics that are of particular interest in the relevant language or of particular interest to individual researchers in the group in question. Hence a complete coordination is neither possible nor desirable. Furthermore, the "state of the art" will necessarily be somewhat different from one language area to the other, as well as the availability of competent researchers. Broadly speaking, however, each of the research groups will chose from the list a few phenomena that they will be particularly occupied with, possibly in addition to topics that will be of comparative interest for members of other groups or topics that are of special "internal interest", as it were. In this way each group will both be carrying out comparative research, partly with special interests of the other groups in mind, as well as collecting material for its own interest.

Considerable parts of the material collected or prepared in the project will be pooled in a joint database. The research group in Oslo will have the responsibility of building up the database and developing the technical tools and applications needed. This work will obviously to some extent be based on previous work done in comparable projects elsewhere, such as the methods used in the presentation of SweDia 2000 and work on the linguistic tagging and preparation of databases and corpora in the different countries, both of spoken and written language, including the work done in Iceland in that area (see "State of the Art" in 14 below).

Although the topics described above are largely couched in technical terms from generative syntax (partly to ease comparison with other projects, partly because the original initiative for the Scandinavian project came from generative syntacticians), there are linguists of very different persuasions and with quite varying interests in the 9 groups in the ScanDiaSyn project (see the network groups at the ScanDiaSyn hompage): some are generative syntacticians, some are sociolinguists, some are specialists in language technology, some are dialectologists, some specialize in corpus linguistics, others in conversational analysis or interactional grammar, etc. The leading idea is that the different types of linguists represented can benefit from working together and learn from each other. Thus the syntacticians can for instance define some of the linguistic variables to be investigated and formulate some of the linguistic hypotheses to be tested, the sociolinguists can provide insights into data elicitation and ways of relating linguistic variation to social variables, the language technologists know how to prepare and handle databases, etc. This way we hope that the different types of linguists and technologists can learn from each other and broaden their horizons. Thus the syntacticians will realize, for instance, that there is more to (scientific) life than traditional syntactic research and the typical introspective procedures of judging sentences (see also the discussion of "state of the art" in 14 and of methodology in 17 below), the sociolinguists will learn something about linguistic hypotheses and ideas about universal grammar, the language technicians will be able to make better use of linguistic concepts and tools in their work, those who work within the framework of interactional grammar will realize that they can benefit from some of the insights of theoretical syntax, etc. The make-up of the groups is quite different, however, and influenced by the availability of experts in each field and the relevant work that has already been done. This will be described in some detail for the Icelandic group below (and for the Faroese one, to the extent that it is relevant here).

Most of the groups have access to material that has previously been collected and to some extent even analyzed or tagged. This material will be investigated and made use of. But since many syntactic constructions and phenomena are infre¬quent in actual conversation and written texts, one generally needs a much bigger corpus for conducting syntactic investigations than is the case with the study of phonological and morphological phenomena. Hence it is in general not enough for the purposes of syntactic investigations to collect spontaneous speech data or consider written texts. Although one will of course get more types and tokens of the different syntactic constructions as the size of the corpus is increased, it is simply impossible to make up a corpus that is big enough too do justice to all conceivable syntactic constructions. Simply put, a large corpus can tell you a lot about which constructions are possible but it does not really tell you which ones are impossible. Hence the syntactician has to rely on other methods are necessary too, such as question¬naires, other written or oral tasks, etc. That way it is often possible to get a more direct channel towards revealing the rules governing syntactic phenomena. At the same time, such elicitation runs the risk of becoming artificial to some extent, but there are various ways of trying to minimize that problem (see e.g. Schütze 1996, Bard et al. 1996, Cowart 1997, Cornips and Poletto 2004 and references cited there for relevant discussion – see also Rickford 1987). This problem has been extensively discussed at preparatory meetings for ScanDiaSyn. One method which aims at "getting the best of both worlds" has been developed in the Dutch dialect syntax project SAND (cf. above). When this method is used, the linguists develop a written questionnaire covering the various phenomena that they are interested in. Then interviews are conducted, centering around the questionnaires. When there are considerable dialectal differences, as there was often the case in the Dutch project and will sometimes be in Mainland Scandinavia, it will be preferable to have two informants speaking the same dialect discuss the topics described in the questionnaire. The interview, or discussion, is recorded and the basic idea is that the material will then both contain judgments of syntactic examples as well as spontaneous speech (see also the discussion of "methodology" in 17 below).

In addition to collection of new material, the plan is to include existing Scandinavian material in the ScanDiaSyn database, such as existing collections of spontaneous speech. Most of the research groups have access to such material, including the SweDia material already mentioned. The SweDia material has already been offered to the ScanDiaSyn project for use. However, these recordings have only been transcribed to a very limited extent and it will be the task of the Lund and Helsinki groups within ScanDiaSyn to transcribe this material. For Danish dialects there already exists a tagged dialect syntax corpus of approx. 1 mill words (CorDiale), covering 150 measure points. The Norwegian groups will also attempt to include existing Norwegian dialect material in the archives to the extent possible and feasible. The availability of Icelandic and Faroese material of this kind will be described below.

2. The Icelandic (and Faroese) project

2.1 Collection of new data in Iceland and the Faroes
The topics that are of particular interest for the Icelandic (and Faroese) groups are to some extent determined by properties of the language and the particular linguistic situation. The points that need to be taken into consideration include the following (Faroese is included in many instances below because of the close connection between the Icelandic and Faroese groups and because of the special comparative interest that Faroese has for Icelandic):

Icelandic and Faroese have richer inflectional morphology than the Mainland Scandinavian languages. Hence constructions involving morphosyntactic phenomena, sucn as case and agreement for instance, might be of special comparative interest here.
Icelandic has preserved various syntactic features that have disappeared in the Mainland Scandinavian languages. Faroese often occupies a middle ground in this respect, making comparison between Icelandic, Faroese and Mainland Scandinavian especially interesting. This is true of subject case marking and stylistic fronting, for instance.
Although there are some known dialectal differences in Icelandic and Faroese, they are relatively minor and people are in general not aware of dialectal differences in syntax. This does not mean that they do not exist, however, and we fully expect to discover some that have hitherto been unknown.
To the extent that syntactic variation is known in Icelandic (and Faroese), or has been studied, it seems to be connected to age groups and social variables rather than to particular geographical areas.
The linguistic communities in Iceland and the Faroes are much smaller than those of the Mainland Scandinavian countries. While there are some 9 million speakers of Swedish and 4.5 million speakers of Norwegian, for instance, there are less than 300.000 speakers of Icelandic and less than 50.000 speakers of Faroese.
Although there are certain similarities between the (official) language policies in Iceland and the Faroes, e.g. with respect to emphasis on the creation of new words and opposition to (English) loans, certain syntactic variants have been frowned upon in Iceland but not in the Faroes. In general it seems that there is a greater tolerance with respect to linguistic changes that cannot be traced directly to foreign influence in the Faroes than there is in Iceland. This makes certain predictions about the relationship of certain variants to social class in Iceland on the one hand and in the Faroes on the other (cf. below).
The difference between "dialect" and "standard language" does not really exist in Iceland and the Faroes to the extent that it does in most countries. People do not switch between "speaking dialect" and "speaking the standard language" to the extent that they do in most other countries. This is important because it facilitates data elicitation to some extent: The investigator does in general not have to worry about not speaking the same syntactic dialect as the informant – or speaking the standard language as opposed to some dialect.
It is generally assumed, on the other hand, that there is considerable difference between "spoken language" and "written language", or between different types or styles of written language, or different genres of texts, although systematic investigation of these differences is just beginning.

Based on the linguistic situation described above, and on (preliminary) results of the pilot study described, it is likely that the syntactic constructions that will be of special interest for the Icelandic (and Faroese) group in this connection will include some of the ones listed below. Most of these are constructions that we have reason to believe to show variation within Icelandic (and Faroese) but some are mainly included since they may be of particular comparative interest for linguists elsewhere in Scandinavia. First we list some constructions where previous research has indicated interesting variation that can profitably be studied with the aid of written questionnaires, at least to some extent:

1.    Subject case (including the change from oblique to nominative subject, and from accusative to dative ...)
2.    The "new impersonal" construction (also known as "the new passive")
3.    Extended progressive aspect (extension to new semantic classes of verbs; possibly involving some change in the semantics of the construction itself)
4.    Long distance reflexives and their relation to the subjunctive.
5.    Tense and mood in embedded clauses.
6.    Agreement with nominative objects.
7.    Possessive constructions and the structure of the extended NP.
8.    Loss of case in topicalization structures.
9.    Object case.
10.    Complex pronominal constructions (each other ...)
11.    Expletive constructions
12.    Stylistic fronting.
13.    Impersonal verbs in control constructions.
14.    Tough-movement.

Results of a pilot study indicate, however, that constructions like the following are judged differently (usually more positively) in oral interviews than in written responses to questionnaires (partly because here factors like stress and intonation play a role):

15.    Position of adverbs in embedded clauses.
16.    Complementizer deletion.

In addition, it seems that certain constructions that have been frowned upon in schools need to be studied in oral interviews, at least when adult subjects are involved. These include #1 and #2 above.

2.2 The use of available databases and corpora in Iceland
Various projects involving databases, text collections and corpora will be connected to or integrated into the present one in Iceland. This is a part of the general plan for ScanDiaSyn and the situation in Iceland makes this even more feasible than in many other places.

First, in the Icelandic spoken language project, ĶSTAL, a corpus of spoken Icelandic was established, based on some 15 hours of spontaneous natural conversations (31 conversations in all). The material has been transcribed, using conventional orthography and including various symbols for marking conversational features such as hesitations, repetitions, overlapping, interruptions, etc. Methods developed in work on the Swedish Spoken Language Corpus in Gothenburg and the British National Corpus were employed in the transcription to the extent feasible. The corpus has already proved its usefulness in several areas of research and teaching as it contains information on aspects of spoken Icelandic never recorded before. The corpus, or at least parts of it, could obviously be profitably included in the planned database of ScanDiaSyn, as the inclusion of similar databases in Scandinavia is also planned. Before this can be done, however, it needs to be tagged. An Icelandic tagger has been developed (a project supported by the Language Technology Program of the Icelandic Ministry of Education) and it could in principle be used on this material. Since the tagger has been trained exclusively on various kinds of "written" (as opposed to "transcribed spoken") Icelandic, it would have to be retrained on this kind of material. That would in itself be a valuable addition to the tagger project and at the same time this training, or the mistakes that the tagger would make when applied to this corpus, would give important information about the differences between the written and spoken variants of Icelandic. - In addition, it is necessary to remove various kinds of personal and sensitive information. Finally, it has to be integrated into the system which is being developed at Tekstlaboratoriet in Oslo in connection with ScanDiaSyn.

Second, Finnur Frišriksson, lecturer at the University in Akureyri, has recorded some 30 hours of spontaneous spoken Icelandic (2-4 participants in each conversation, conversations from 9 different places in Iceland, 12 subjects from each place). Finnur has collected this material in connection with his dissertation project at the University of Gothenburg. In his dissertation he has been studying the distribution and frequency of various (recent or famous) syntactic phenomena in Icelandic, including the so-called Dative Sickness (change from accusative to dative case on the subject of certain verbs) and the New Impersonal Construction (or New Passive), cf. items 1 and 2 on the list in 2.1 above. He is a member of the Icelandic ScanDiaSyn group and is willing to have some of his corpus included in the ScanDiaSyn database if his informants agree. Before that can be done, the corpus has to be scanned for sensitive or personal information, tagged etc. It would obviously be an important addition to the database – and the application of the Icelandic tagger to this kind of material could also give important information about the characteristics of spoken vs. written Icelandic.

Third, work is under way in the creation of a large tagged corpus of Icelandic texts of different kinds, ranging from various kinds of books, newspapers, journals and reports to texts from the Internet. The work on this project will be supported by the Language Technology Program of the Icelandic Ministry of Education and it is being hosted by Oršabók Hįskólans (The Icelandic Dictionary Project). As there are undoubtedly interesting syntactic differences between spoken and written Icelandic, as well as between different types of texts, access to this corpus will be an important asset to the project at hand. The tagger that has been developed will mark part of speech, case, number, gender, tense, mood, etc. But if the corpus is to be really useful for a syntactic project like the present one, it would have to be parsed syntactically, giving information about subject, object, verb phrase, etc. And this brings us to the fourth project.

Fourth, a syntactic parser is being developed by the private company Frišrik Skślason. That project is also being supported by the Language Technology Program of the Icelandic Ministry of Education. Until now this parser has been trained almost exclusively on typical newspaper material and it would be very important for the project to get the opportunity to try it out on various kinds of (tagged) texts. Applying this parser to the different kinds of texts described above would at the same time yield interesting information about the differences between various text types, and even between typical written language and transcribed spoken language, since the parser will almost certainly yield interesting but wrong results when applied to a transcribed corpus of spontaneous speech. Hence Frišrik Skślason is connected to the present project (cf. below), as this promises to be a symbiotic relationship.

As should be clear from this, the present project brings together investigators from various areas in Iceland. It attempts to make better use of their expertise and the resources that they have developed, or are developing, by creating a large umbrella project for them to cooperate in and share their work and ideas. Hopefully, the results from the syntactic investigations will shed some light on the syntactic nature of different texts and thus contribute to the improvement of the corpora. Conversely, syntactic data discovered in the different databases and corpora will undoubtedly raise new syntactic questions which can then be investigated further by using questionnaires and interviews.

2.3 The partial inclusion of Faroese
The numerous references to Faroese above are explained by the fact that the "Faroese group" of the ScanDiaSyn project includes some Icelandic researchers that have been working on Faroese in the past, namely Höskuldur Žrįinsson, Jóhannes Gķsli Jónsson and Žórhallur Eyžórsson. Part of the reason is that comparison with Faroese provides an excellent testing ground because of the similarities between the two languages. In addition, there are not too many research funds available to Faroese linguists so cooperation with linguists abroad is always welcomed by them. Hence some comparative research on selected topics in Faroese is planned as a part of this Icelandic project (see 17 below), but it is also hoped that other members of the Faroese group will be successful in securing some research funds to facilitate the inclusion of Faroese into the ScanDiaSyn project as a whole. No work on existing databases in the Faroes is planned as a part of the present project, for instance, but samples of Faroese texts that were scanned in connection with a previous project will be fixed up and made accessible on the Internet as a part of the present project (cf. 17 below).

3. Summary – and some hypotheses

The objectives of this project can then be summed up as follows:

    1.    To collect new data on syntactic constructions in Icelandic in a systematic fashion with the guidelines established by the ScanDiaSyn project in mind. The Icelandic project has thus an important comparative feature with special emphasis on Faroese.
    2.    To develop further and make new use of various resources that have been created by previous and ongoing research projects, including databases of spontaneous spoken Icelandic, a tagged corpus based on a large variety of texts, and a syntactic parser. The leading idea is that the cooperation between the different researchers involved and the symbiotic relationship between the projects in question should lead to an improvement of the resources in question (the databases/corpora/parser ...) and thus make them even more useful and usable in the future.
    3.    To build on previous research on syntactic variation in Icelandic (and Faroese) and thus add to the knowledge already established.
    4.    To contribute to international cooperation between linguists and researchers in related or connected fields.

Because of the complexity of the project and the different types of researchers involved it is not simple to formulate testable research hypotheses for the project in general. They will vary to some extent from one researcher to another, depending on their theoretical persuasions and the type of research they are mostly interested in. The syntacticians will thus be interested in syntactic characteristics of the variation, which kinds of variants go together, what kind of variation (and change) should be possible within the framework of Universal Grammar, which characteristics are likely to get lost and why, when language is passed on from one generation to another, etc. In addition, variation is of interest to theoretical linguists in and of itself since it has often been claimed that there is no such thing as "free variation" of syntactic variants (or linguistic variants in general – see e.g. Höskuldur Žrįinsson 2003 and references cited there). The sociolinguists will be interested in the ways that the variants can be linked to sociological differences, including variation between male and female subjects. Still others will be interested in trying to characterize the differences between (different kinds of) written language on the one hand and (transcribed) spontaneous spoken language on the other.

Given the interests, persuasions and frameworks of the researchers involved, we can only give a couple of examples of the kinds of hypotheses that can be formulated and tested (and they can, of course, be either true or false!):

    1.    There is no geographically "conditioned" variation in Icelandic syntax nor in Faroese syntax. When ther appears to be geographically conditioned variation, there is always some other explanation behind it, such as a sociolinguistic or sociological one (e.g., differences w.r.t. education or class). (The "New Impersonal" construction in Icelandic might provide an intersting test case.)

    2.    There is no "free" variation between syntactic variants – two variants are never completely equivalent. (Variation in embedded clause word order in Faroese might be a case in point here.)

    3.    There is considerable syntactic variation between the speech of younger and older speakers of Icelandic and Faroese and this variation represents "ongoing changes", i.e. changes that are spreading through the linguistic community. The direction of some of these changes can be predicted on structural grounds and thus we expect the development to be parallel in both languages, although the speed of the spreading may vary. A case in point (no pun intended) would be changes in the case marking of subjects and objects (Nominative Substitution, Dative Substitution – cf. e.g. Jóhannes Gķsli Jónsson 2003, Jóhannes Gķsli Jónsson and Žórhallur Eyžórsson 2003a,b). Another predictable development could be the relationship between long distance reflexivization and mood in Icelandic: While there is (as far as we know today) a clear relationship between long distance reflexives and subjunctive in the speech of most speakers of Icelandic, we might expect this to change in such a way that some speakers might be able to use long distance reflexives in subjunctive AND indicative clauses, but we would not expect any speakers to be able to use long distance reflexives exclusively in indicative clauses.

    4.    Linguistic variation typically stems from changes that occur when language is passed on from one generation to the next. Hence we do not expect changes to "start out" among the older generations or innovations to be more common in the speech of the older generations. (Comparison of the variation in subject case and the variation fount in "the extended progressive" in Icelandic might yield different results here.)

    5.     Socially conditioned variation in syntax will be found in Icelandic and Faroese to the extent that the variants in question have been stigmatized or are considered "bad" by (influential elements in) the linguistic community and hence fought against or corrected in the schools. Thus we expect to find socially conditioned variation in the use of subject case in Icelandic (i.e. with respect to Dative Sickness) but not in Faroese to the same extent since the development has not really caught the attention of the language preservers ("Dative Sickness" is not considered an epidemic in the Faroes).
This should suffice to give an idea of some of the kinds of hypotheses that can be formulated. By and large, the formulation will be left up to the individual researchers involved.

References
Bard, E.G., Robertson, D. and Angelica Sorace. 1996. Magnitude estimation of linguistic acceptability. Language 72: 32-68.
Cornips, Leonie. 2000. Spontaneous Speech Data Compared to Elicitation Data: The Test Effects. A paper presented at a workshop on Syntactic Microvariation, Meertens Institute, August 30-31, 2000. (Accessible at: http://www.meertens. knaw.nl/projecten/sand/sandworkshop/cornips.html ).
Cowart, Wayne. 1997. Experimental Syntax. Applying Objective Methods to Sentence Judgments. Sage Publications, Thousand Oaks.
Cornips, Leonie, and Cecilia Poletto. 2004. On standardising syntactic elicitation techniques (part 1). Lingua.
Höskuldur Žrįinsson. 2003. Syntactic Variation, Historical Development, and Minimalism. Randall Hendrick (ed.): Minimalist Syntax, bls. 152–191. Blackwell, Oxford.
Jóhannes Gķsli Jónsson. 2003. Not so Quirky: On Subject Case in Icelandic. In Ellen Brandner and Heike Zinsmeister (eds.): New Perspectives on Case Theory, pp.127-163. CSLI Publications, Stanford.
Jóhannes Gķsli Jónsson and Žórhallur Eyžórsson 2003a. The Case of Subject in Faroese. Working Papers in Scandinavian Syntax 72:207-231.
Jóhannes Gķsli Jónsson and Žórhallur Eyžórsson. 2003b. Breytingar į frumlagsfalli ķ ķslensku. [Changes in Subject Case in Icelandic.] Ķslenskt mįl og almenn mįlfręši 25:7-40.
Rickford, John. 1987. The Haves and Have Nots: Sociolinguistic Surveys and the Assessment of Speaker Competence. Language in Society 16: 149-177.
Schütze, Carson T. 1996. The Empirical Base of Linguistics. Grammaticality judgments and linguistic methodology. Chicago: The University of Chicago Press.