1 Much of language production is concerned with referring back to entities that were introduced at some earlier point in the ongoing discourse. Languages ​​provide a wide variety of referential expressions for this purpose. These expressions can be ordered on a scale of explicitness, ranging from fully reduced pronouns (null pronouns) to full lexical noun phrases (NPs). A simplified version of this referential scale is shown in [1] (see Ariel [2001] for the complete scale).

[1]The referential form scale
null pronoun> pronoun> demonstrative> full noun phrase

2A large body of cross-linguistic research (for an overview, see Arnold, 2010) has shown that the position of an expression on the referential form scale correlates with discourse properties of its antecedent. As a first approximation, this correlation can be stated as follows: reduced referential forms refer to entities that are highly salient at the current point of the discourse whereas more explicit devices are used when referring to entities that are currently not salient. Intuitively, such a correlation makes sense. After all, when a referent is salient at the current point of discourse, it will be in a highly activated state in working memory and a reduced or even null expression will suffice to refer to this referent. The other way round, a referent that has not been mentioned recently will not be in an active state and more explicit means will be needed to refer to such a referent.

3Several theoretical approaches exist that are built on the idea of ​​a close relationship between referential form and memory states (Givón, 1992; Gundel et al., 1993; Ariel, 1990; Grosz et al., 1995). In the following, we will concentrate on accessibility theory (Ariel, 1990 and 2001), as this is the most comprehensive theory. The major thrust of accessibility theory has been stated succinctly by Ariel (2001: 29):

Accessibility theory offers a procedural analysis of referring expressions, as marking varying degrees of mental accessibility. The basic idea is that referring expressions instruct the addressee to retrieve a certain piece of given information from his memory by indicating to him how accessible this piece of information is to him at the current stage of the discourse.

4 Arnold (2010) lists three properties of a referent that contribute to its accessibility: recency, givenness, and syntactic prominence. Recency refers to the number of sentences that have been produced since the last mention of the referent. Referents that have been mentioned recently are more accessible than referents that have not been mentioned recently. Givenness captures whether a referent was already mentioned before or not. Given referents are more accessible than new referents. Syntactic prominence concerns sentence internal properties of an antecedent. Two different properties can increase a referent’s syntactic prominence and thereby make it more accessible. First, a subject is more prominent than an object. Second, a sentence initial NP is more prominent than a sentence final NP.

5 Another line of research has stressed the importance of world knowledge and coherence relations for the interpretation of pronouns (see Kehler & Rohde, 2013, for a review of this research tradition). Kehler and Rohde (2013) propose a Bayesian model of pronoun interpretation that combines the accessibility and the coherence account in a probabilistic way. In line with other researchers (e.g., Fukumura & Van Gompel, 2010), Kehler and Rohde (2013) review evidence suggesting that in contrast to pronoun interpretation, the choice of a referential form depends on accessibility alone. Since we are concerned with production in this paper, we concentrate on accessibility, but we come back to the issue of coherence relations when presenting our experimental results.

6Of special interest for the notion of accessibility are referential expressions that differ in form but not in lexical content. In this paper, we focus on the difference between two types of pronouns in German - personal pronouns likehe (“He”) and so-called d-pronouns likethe (lit. “the”). D-pronouns are often analyzed as a kind of demonstrative pronoun. Alternatively, they have been claimed to be a variant of the personal pronoun (for further discussion, see Ahrenholz, 2007) or definite determiner phrases with an empty NP (Wiltschko, 1998).

7This paper presents a corpus study and an experiment that have investigated the factors that determine the choice between personal pronoun (p-pronoun for short) and d-pronoun during written language production. The organization of this paper is as follows. The next section gives an overview of prior research concerned with p- and d-pronouns in German. The corpus study is presented in Section 3. Section 4 presents the production experiment. Section 5 will discuss the results presented in Sections 3 and 4 with regard to factors going beyond accessibility. The paper concludes with a final discussion in Section 6.

8A prototypical example for the use of personal pronouns is given in [2]. In this example, the pronoun’s antecedent is highly accessible in the sense discussed above: it is already given in the discourse, it occurs in sentence-initial position, and it is a subject2.

[2][C -2] Just as the shadow of a tree with large, broad leaves was disappearing, emerged from it Senragor Allagan up and grinned eerily at his half-uncle. [C -1] boy had black, tangled, shoulder-length hair, pale skin and dark but penetrating eyes. [T] He[p-pronoun] was only nine, but still astonishingly tall and well developed for his age.
‘[C -2] Just when the shadow of a tree with large, wide leaves vanished, Senragor Allagan appeared and grinned scarily at his half-uncle. [C -1] The boy had black, fuzzy, shoulder-length hair, a pale skin and dark but shrewd eyes. [T] Hey[p-pronoun] was just nine, but surprisingly large and well developed for his age. ’
(corpus = “DeWaC-9” text = “788291” id = “”)

9A prototypical example for the use of d-pronouns is given in [3]. In this case, the pronoun’s antecedent is discourse new, sentence finally occurs, and is an object.

[3][C -3] Deutsche Bahn AG invites you to the ceremony. [C -2] The AG is ten years old, and the ballroom at the Ritz-Carlton is just good enough for the congratulatory tour. [C -1] The keynote speaker is Hartmut Mehdorn the Chancellor, who is on friendly terms with him involved. [T] The one does not have to be a prophet, will praise the wisdom of the legislature who abolished the authority railway on January 1st 1994 and created the commercial enterprise railway.
‘[C -3] The Deutsche Bahn AG (German Railway Company) is inviting to celebrate. [C -2] The company will be 10 years old and the ball room of the Ritz-Carlton is just sufficient for congratulations. [C -1] The head of the company Hartmut Mehdorn employed the Federal Chancellor that he is on cordial terms with. [T] Hey[d-pronoun] will - you do not have to be a prophet - praise the wisdom of the legislature, who abolished the state-run enterprise and created the business corporation Bahn on 1st January 1994. ’
(corpus = “DeWaC-8” text = “678968” id = “”)

10 With regard to lexical content, the p-pronounhe in [2] and the d-pronounthe in [3] do not differ from each other. Both are specified for the features “masculine” and “singular” 3. In the linguistic literature (Abraham, 2002; Wiemer, 1996; Zifonun et al., 1997), several of the properties discussed above - givenness, syntactic function and linear position - have been considered as candidates for differentiating between p- and d-pronouns . A certain consensus emerging from this literature is that the main functions of p- and d-pronouns must be stated in information-structural terms. Broadly speaking, p-pronouns serve the function of topic continuation whereas d-pronouns signal a topic shift. More recently, the interpretation of p- and d-pronouns has been the subject of several experimental investigations, all concerned with the question of how pronouns are interpreted when the context contains more than a single potential antecedent (Bosch & Umbach, 2007; Bouma & Hopp, 2007; Colonna et al., 2012; Ellert, 2013). As the overview in Ellert (2013: 6) shows, most studies have found that a p-pronoun preferentially takes a subject NP as antecedent, independently of its discourse status or clausal position, whereas the preferred antecedent of a d-pronoun is a discourse -new NP in clause final position, independently of the NP's syntactic function. For purposes of illustration, consider the following examples from Ellert (2013).

[4]a.The cabinet is heavier than the table. / The cupboard is heavier than the table.
The cupboard is heavier than the table. / Heavier than the table is the cupboard.
b.He / He comes from a furniture store in Belgium.
{P-pronoun / D-pronoun} comes from a furniture store in Belgium.
‘The cupboard is heavier than the table. It comes from a furniture store in Belgium. ’

11 The context sentence in [4a] is either a subject-initial or a subject-final sentence. The context sentence is followed by the target sentence [4b], which starts either with a p-pronoun or a d-pronoun. In two experiments, Ellert (2013) measured eye-movements while participants simultaneously listened to sentence pairs as in [4] and looked at pictures of the two referents mentioned in the context sentence. The results show a preference for the subject referent when hearing the p-pronounhe. When hearing the d-pronounthe, in contrast, a preference for the clause-final referent was observed.

12 Due to the lack of a preceding context, the discourse status of the two NPs in [4] was not explicitly specified. Ellert (2013) nevertheless interprets her results in terms of information structure and not in terms of surface properties like syntactic function or linear position within the sentence. This interpretation hinges on default associations between syntactic structure and information structure according to which the sentence topic is preferentially realized by the clause-initial NP whereas the sentence focus typically occurs clause finally. Given these default associations, the results can be rephrased as follows. For canonical subject-initial context sentences, the p-pronoun preferentially refers to the topic and the d-pronoun to the focus. For non-canonical subject-final sentences, in contrast, both p- and d-pronoun prefer the focus NP as antecedent.

13Based on a broad survey of the existing evidence, Bosch (2013: 42) arrives at the generalization in [5], where “DPro” stands for d-pronoun and “PPro” for p-pronoun (see also Hinterwimmer, 2014).

[5]a.In contexts that provide only one grammatically suitable referent for the pronoun, DPro and PPro occur in free variation, and without any semantic difference.
b.Whenever a DPro must choose among several grammatically suitable referents, it avoids the current topic.

14This generalization seems to capture the interpretation of d-pronouns quite accurately, but some problems remain. Most experiments on German are similar to the experiments of Ellert (2013) in that they do not provide enough context to unambiguously determine the discourse status of the potential antecedents provided in the context sentence. It is therefore necessary to assume that participants compute a particular information structure for the context sentences by default. Even if this assumption were correct, the interpretation of the observed preferences in terms of information structure is by no means obligatory. As information structure and linear position are confounded, the preferred interpretation of the d-pronoun could as well be stated in linear terms - the d-pronoun preferentially refers to the clause-final NP.

15A study that provided enough context for controlling the discourse status of the potential antecedent NPs is the study of Finnish p- and d-pronouns by Kaiser and Trueswell (2008). However, even in this study information structure and linear position were confounded. Kaiser and Trueswell (2008) obtained the same pattern that was found for German - a subject preference for the Finnish p-pronoun and a preference for the final, new NP for the Finnish d-pronoun. Since given NPs always occurred sentence-initially and new NPs sentence-finally in their experiment, it is not possible to decide whether the preference observed for the d-pronoun is an effect of givenness or an effect of position.

16 In order to provide new evidence on this issue, Bader and Portele (2015) closely followed the experimental design of Kaiser and Trueswell (2008), but went beyond this study by also varying the linear position of the given and the new referent. In the configuration used by all prior studies (clause-initial given referent, clause-final new referent), Bader and Portele (2015) found again that p-pronouns prefer subjects as antecedents whereas d-pronouns preferentially refer back to referents that are new and occur in clause-final position. In order to disentangle givenness and position, Bader and Portele (2015) investigated short texts as in [6]. Here, the discourse-given NP the clown (“The clown”) occurs clause-finally whereas the discourse-new NP a man (“A man”) occurs clause-initially.

[6]Maria was at the circus on Sunday. She saw before the performance a clown walking around. A man hugged the clown. He Has…/The Has…
‘Maria visited a circus on Sunday. Before the show, she saw a clown walking around. A man hugged the clown. He (p-pronoun / d-pronoun) has ... ’

17 Sentence fragments starting with a p-pronoun showed again a subject preference. Sentence fragments with a d-pronoun were most of the time completed in such a way that the d-pronoun was co-referential with the NP the clown. Since the referent of this NP has already been introduced in the sentence before by the indefinite NP a clown, this means that the d-pronoun prefers a given NP as antecedent. Under the assumption that the referent of the given NP the clown (“The clown”) in the final context sentence in [6] is the sentence topic, this constitutes an exception to the hypothesis that a d-pronoun preferentially refers to a non-topic. We will not discuss this issue at this point, but come back to it in the general discussion.

18 To sum up so far, experimental investigations of the interpretation of p- and d-pronouns converge on two conclusions. First, p-pronouns preferentially refer to an antecedent in subject position, independently of the givenness and the clausal position of the subject. Second, by itself, neither givenness, nor syntactic function, nor clausal position can account for all of the interpretative preferences found for d-pronouns. From the standpoint of accessibility theory, this is no surprise because a central assumption of this theory is that accessibility is a complex property that cannot be defined in terms of a single feature. In accordance with this assumption, Bader and Portele (2015) propose that d-pronouns prefer as antecedent the NP which is least accessible, where accessibility is defined at least in terms of the givenness, the syntactic function and the clausal position of the competing antecedent NPs. For example, in a sentence with subject-object (SO) order, the subject is always more accessible than the object because it is favored by two of the three defining properties - syntactic function and clausal position. Thus, even when the object is given, as in [6], it is still less accessible than the subject, and thus the d-pronoun prefers a given antecedent in this case.

19 One additional factor that has to be taken into account is the referential form of the competing antecedents. As has been pointed out in the literature (Bosch & Umbach, 2007), when a clause-final object is a pronoun itself, the d-pronoun prefers to refer to the clause-initial subject referent, contrary to its usual preference. An authentic example of this kind is provided in [7].

[7][C -2] Klaus went on. [C -1] And as he had walked a distance, came a guy on him to. [T] The not only looked like the devil, but he it was too.
‘[C -2] Klaus went ahead. [C -1] After walking a while, some guy approached him. [T] Hey[d-pronoun] did not only look like the devil, hey[p-pronoun] was the devil. ’
(corpus = “DeWaC-6” text = “494949” id = “”)

20 The pronounhim in the last context sentence is given (increasing its accessibility), an object (decreasing its accessibility) and in final position (also decreasing its accessibility). This pronoun should therefore be less accessible than the indefinite NP a guy, which has only one feature that decreases accessibility (it is new), but two features that increase accessibility (it is a subject in clause-initial position). If d-pronouns always referred to the less accessible of two potential antecedents, it should not have been used here. The fact that a d-pronoun is nevertheless used to refer back to a guy thus indicates that pronouns are inherently more accessible than lexical NPs4.

21 From the perspective of language production, the findings from language interpretation raise a range of interesting questions. First of all, a speaker has to make a choice concerning the linguistic form of a referential expression whether there is an ambiguity or not. Most of the existing literature has been concerned with the interpretation of pronouns and has therefore investigated examples that contain two potential antecedents matching the pronoun in morpho-syntactic features. In this case, a relative decision rule can be used. The accessibility of each potential antecedent is determined, and the one with the highest or lowest accessibility value is chosen as antecedent, depending on the particular pronoun. When only a single potential antecedent is available, reference is not ambiguous and therefore no choice must be made.

22 During language production, however, a choice between p- and d-pronoun is necessary whether a competing antecedent is available or not. Like for language interpretation, relative accessibility may be decisive when the context contains a competing referent, but relative accessibility will be of no help when there is no competing referent. According to Bosch’s generalization given in [5], in this case p- and d-pronouns are in free variation, and which of the two is used makes no semantic difference. Unless there is a random choice of pronoun form in contexts lacking competing referents, the speaker needs an absolute decision rule, that is, a decision rule that only considers the properties of the single referent under consideration. Such a decision rule could specify some kind of accessibility threshold. When the accessibility of the referent is above this threshold, the p-pronoun is used, when it is below this threshold, the d-pronoun is used. Given that in certain examples both p- and d-pronouns seem to be acceptable, the threshold must be variable or probabilistic in some way. In the current context, the major question is whether a single notion of accessibility can be found that accounts for the choice between p- and d-pronoun in the absence as well as in the presence of a competing referent.

23 A further question relates to the finding that during language comprehension p-pronouns and d-pronouns seem to be differentially sensitive to the various dimensions defining accessibility. As discussed above, p-pronouns prefer antecedents that have the syntactic function of subject whereas d-pronouns prefer the least accessible referent as antecedent. When comprehending language, the hearer or reader knows which pronoun to interpret. Because the pronoun is provided explicitly as part of the input, item-specific preferences of different pronouns can be retrieved from the mental lexicon and be applied to the task of interpretation. When producing language, in contrast, the speaker or writer must choose a pronoun based on the given state of the referent in the current discourse. Taking item-specific preferences into account when making this choice is thus not as straightforward as it is for language comprehension, because the preferences of several possible referential expressions have to be considered simultaneously. Furthermore, the particular preferences found for interpretation can easily lead to a tie. For example, in order to help the hearer, a speaker who wants to refer back to the subject in an object-before-subject sentence could either use a p-pronoun (because p-pronouns prefer a subject antecedent) or a d-pronoun (because d-pronouns prefer a sentence final antecedent). The question then is how the speaker nevertheless comes to a decision.

24Because of its focus on pronoun interpretation, the existing literature does not provide much information on these questions. We know of only two studies that have addressed the choice between p-pronoun and d-pronoun during language production. Bosch et al. (2003) present a corpus study based on the “Negra” corpus, a corpus of German newspaper texts. They found 1,436 p-pronouns and 180 d-pronouns. For p-pronouns with an antecedent in the immediately preceding clause, the antecedent was a subject in 86.7% of all cases and a non-subject in the remaining 13.2% cases. This confirms the strong subject orientation of p-pronouns. For d-pronouns, in contrast, an object bias was found. With 76.4% non-subject and 23.6% subject antecedents, the object bias for d-pronouns was somewhat weaker than the subject bias for p-pronouns. Since Bosch et al. (2003) did not look at other features of the antecedent, their study leaves open whether properties other than the antecedent’s syntactic function influence the choice between p- and d-pronoun, or even make reference to the antecedent’s syntactic function superfluous.

25Bittner and Dery (2015) had participants narrate short picture stories and found different preferences for German p- and d-pronouns in terms of discourse coherence. In case of situations not involving anaphoric disambiguation, p-pronouns are used to background the referent whereas d-pronouns serve a forward orientation in the discourse. The authors claim that for both devices, the two types of pronouns can be described in different terms based on the salience and / or activation of their referents: whereas p-pronouns are chosen to refer to salient and activated referents, d-pronouns are chosen to refer to referents that need to be strengthened in terms of salience / activation. Bittner and Dery (2015: 67) found that in situations encompassing pronoun use in anaphoric disambiguation, however, the choice between p- or d-pronouns may not be described in terms of information status of the referent in the ongoing discourse.

26 In order to broaden the empirical basis with regard to the choice between p- and d-pronouns during language production, we conducted a corpus study which was backed up by a production experiment testing the influence of givenness and syntactic prominence. Both the corpus study and the production experiment are confined to written language. Possible limitations resulting from this restriction are discussed below.

27 The corpus analyzed in this paper is the “DeWaC” corpus made available by the University of Bologna (see Baroni et al., 2009; and The “DeWaC” corpus is a huge part-of-speech tagged corpus of written German built by web crawling. It contains about 1,600,000,000 tokens of text in approx. 92,000,000 sentences. We first discuss the syntactic construction that we chose for analysis. We then describe how the corpus examples were extracted and prepared for later analysis. We finally present various analyzes of the extracted examples.

3.1. Choice of syntactic construction

28 In accordance with the experimental literature on this topic, we restrict our analysis to sentences in which the pronoun occurs clause-initially as the subject within a main clause. This was the case for all the examples considered so far. Excluded from the analysis are thus object pronouns in general and subject pronouns occurring in the so-called middlefield of a German sentence5. An initial screening of about a sixth of the "DeWaC" corpus revealed about 149,183 hits for the query "He + finite verb "and 6,518 hits for the query"The + finite verb ”. Thus, sentences with an initial p-pronoun occurred about 23 times more often than sentences with a d-pronoun. This ratio is almost three times higher than the ratio found by Bosch et al. (2003). There are two main differences between the study of Bosch et al. and the current study6. First, two different corpora were investigated. Second, we only consider sentences in which a subject pronoun occurs clause-initially within a main clause whereas the position of the pronouns was not restricted in the study of Bosch et al. How these differences account for the different ratios between p- and d-pronouns is an open question7.

29 For the corresponding accusative object pronouns, similar searches revealed 582 hits forHim (“Him / p-pronoun”) followed by a finite verb and 943 hits forThe (“Him / d-pronoun”) followed by a finite verb. Thus, in striking contrast to the case of subject pronouns, for pronouns in the function of a direct object the d-pronoun outnumbers the p-pronoun. The reason for this is the well-known fact that personal pronouns in object function are severely restricted with regard to their occurrence in the prefield of a main clause, whereas d-pronouns are not restricted in the same way (see Lenerz, 1992). Since the syntactic constraints that are responsible for the placement of object pronouns are beyond the scope of the current paper, we only consider subject pronouns in the following.

30 In an additional search, we looked for strings of the form “finite Verb +he”. This search string corresponds to sentences in which the subject pronounhe is located within the first position after the finite verb in a verb-second sentence. This position is assumed to be the preferred place for topics in general and personal pronouns in particular (Rambow, 1993; Frey, 2004). There were 174,098 corpus hits for he immediately following the finite verb, which contrasts with the 149,183 hits forhe immediately preceding the finite verb. Thus, the subject pronounhe occurred more often after than before the finite verb. However, the frequency difference is only moderate, and in absolute terms, both constructions are of very high frequency.

31 In sum, restricting our analysis to subject pronouns in sentence-initial position allows us to concentrate as far as possible on factors that are immediately relevant for defining accessibility and thus for choosing between a p- and a d-pronoun. Extending this line of research to other cases, for example to sentences with object pronouns, must be left as a task for future research.

3.2. Corpus preparation

32 We first retrieved all sentences beginning either with the p-pronounhe or the d-pronounthe immediately followed by a word tagged as a finite verb. For each sentence, the preceding context was also retrieved, limited to five sentences. This resulted in a total number of 940,779 corpus hits. Of these, 901,486 or 95.8% contained the p-pronoun and 39,293 or 4.2% the d-pronoun, resulting in a ratio of 23: 1 as in the subset analyzed above.

33 Because the set of corpus hits was too large to be analyzed completely, we drew a random selection of 500 examples for each pronoun. All examples were checked and erroneous examples were removed from the sample. Most of these were false hits because a word that was not a verb had been tagged as verb in the “DeWaC” corpus. In addition, in some cases the preceding context and the target sentence did not form a coherent discourse, reflecting problems with automatically deriving texts from internet sites. Finally, in five examples in the d-pronounthe was feminine, thus acting as a dative object. The final sample contained 465 instances for the p-pronoun and 436 instances for the d-pronoun. This means that the proportions of p- and d-pronoun examples in our sample does not match the proportion in the complete “DeWaC” corpus. We will therefore always report separate percentages for the two pronouns in the following analyzes.

34 All instances were inspected and all NPs co-referent with the pronoun were coded by hand. The last co-referential NP will be called the antecedent NP, with one exception as explained below. For ease of exposure, we will use the term antecedent both for the antecedent NP and for the referent of the antecedent NP in the following, unless the context requires the more specific term. In each of the prior examples [2] and [3], the antecedent is the co-referential NP in the immediately preceding sentence. There are two cases where identifying the antecedent is not straightforward. In the first one, a reflexive pronoun is the last co-referential element, as in [8].

[8][C -1] After the bird as far as was satisfied with the nest, turned he himself and hopped closer to the window to look at David. [T] He hopped up and down and twittered loudly and only then did David notice the crooked leg!
‘[C -1] anus the bird was satisfied with the nest, hey[p-pronoun] turned (himself) around and hopped closer to the window to look at David. [T] Hey[p-pronoun] leaped up and down twittering loudly and it was not until then that David noticed the injured leg. ’
(corpus = “DeWaC-6” text = “500797” id = “”)

35 There were 19 cases of this kind, 15 with a following p-pronoun and 4 with a following d-pronoun. Because in most cases the reflexive was an inherent reflexive, we do not take the reflexive as the antecedent of the following pronoun but the reflexive’s antecedent, he (“He”) in the example above.

36A related case is illustrated in [9]. Here, the last co-referential NP is the possessive pronounhis which itself is co-referent with the subject NPDöring8.

[9][C -1] Döring started his professional career in administration. [T] He worked on the planning staff of the President of the University of Hamburg and subsequently in the Schleswig-Holstein Ministry of Education.
‘[C -1] Döring started his professional career in administration. [T] Hey[p-pronoun] among others things took part in the planning staff of the president of the Hamburg University and afterwards worked in the Ministry of Education of Schleswig-Holstein. ’
(corpus = “DeWaC-9” text = “801409” id = “”)

37 Here, one may again wonder whether the antecedent ofhe is the proper nameDöring or the intervening possessive pronounhis. The issue is somewhat more complicated than in the case of reflexives because the antecedent of the possessive pronoun does not necessarily occur within the same sentence. This was the case in 8 out of the 54 corpus texts where a possessive pronoun was the last expression co-referential with the upcoming sentence-initial pronoun. Of these 54 corpus texts, 48 ​​contained the p-pronounhe and 6 the d-pronounthe. We removed these 54 corpus texts from our sample and present a separate analysis for them after we have presented the analysis of the main corpus. This way, we can let our data decide which NP is the antecedent of the p- or d-pronoun - the possessive pronoun or its antecedent. The main corpus thus contains 417 instances of the p-pronoun and 430 instances of the d-pronoun.

38 The following properties of the antecedent were coded partly by hand, partly automatically in order to uncover the factors that govern the choice between p- and d-pronoun.

  • Givenness - number of mentions: the number of NPs referring to the antecedent’s referent, including the antecedent NP itself.
  • Givenness - given or new: the antecedent was classified as given if the preceding context contained at least one additional reference to it, that is, when the number of mentions was two or greater. Otherwise the antecedent was classified as new.
  • Syntactic function: all antecedents were classified as either subject or non-subject.
  • Position within clause: when no further referential NP occurred after the antecedent, it was classified as final. Otherwise it was classified as non-final. Non-referential NPs that were not counted when determining the clausal position were predicative NPs in copula constructions and NPs that are non-referential because they are part of an idiomatic expression.
  • Recency: the number of context sentences intervening between the pronoun and the antecedent. When the antecedent was contained in the context sentence immediately preceding the pronoun sentence, this number was zero.
  • Animacy: according to prescriptive grammars of German, it is impolite to refer to a person by a d-pronoun unless one wants to put special emphasis on the pronoun (Dudenredaktion, 2011). In order to test for such an influence, we coded all antecedents as human if they referred to persons or collections of humans like institutions or companies. All other antecedents were coded as non-human.
  • Definiteness: all antecedent NPs were classified into the six definiteness categories shown in [10]. The definitions for proper names, definite NPs and indefinite NPs follow the corpus study of Van Bergen & de Swart (2010).
    [10]a.p-pronoun: personal pronouns including possessive pronouns;
    b.d-pronoun: demonstrative pronouns used without a following noun;
    c.proper name: personal names, place names and names of companies;
    d.definite NP: nouns preceded by a definite article, a demonstrative article, a possessive determiner or a strong quantifier;
    e.indefinite NP: bare nouns, generic nouns and nouns preceded by a weak quantifier or an indefinite article;
    f.w-word: the w-wordwho (“Who”), either in a question or, more often, in a free relative clause, as in example [11].
    [11][C -1] who no longer has this goal, there is no need to develop ways or partial steps. [T] The manages the existing.
    ‘[C -1] Who does not have this goal anymore, does not need to develop ways and substeps. [T] Hey[d-pronoun] maintains what is established. '
    (corpus = “DeWaC-6” text = “496753” id = “”)
  • Ambiguity: all masculine singular NPs not co-referent with the pronoun were marked as competitors.

3.3. Descriptive results

39 This section presents the results of the individual properties that were defined above. An analysis that takes all properties into account simultaneously is presented in the next section. All statistical analyzes reported here and later were computed using the statistics software R, version 3.2.3 (R Development Core Team, 2015).

3.3.1. Recency, givenness and syntactic prominence

40We start by considering three properties that are identified by Arnold (2010) as crucial for defining accessibility: the givenness, the syntactic prominence and the recency of the pronoun’s antecedent.

Table 1. Percentages of number of sentences intervening between pronoun and antecedent, depending on pronoun type


41 In order to determine the influence of recency, Table 1 shows the distance between pronoun and antecedent in terms of the number of sentences intervening between pronoun and antecedent. Table 1 reveals that in the vast majority of sentences, no sentence intervenes between pronoun and antecedent. In other words, with few exceptions the antecedent occurs in the sentence immediately preceding the sentence containing the pronoun. This holds for d-pronouns slightly stronger than for p-pronouns. Although the difference is small, it is significant (Fisher’s exact test, p = 0.006).

Figure 1. Number of mentions of the pronoun’s referent in the preceding context forhe (p-pronoun) andthe (d-pronoun)

42 With regard to givenness in terms of number of mentions, consider first Figure 1, which shows how often the referent of the pronoun was referred to in the preceding context. A single mention means that the antecedent NP was the only referential expression co-referent with the pronoun. When the number of mentions was higher than one, the antecedent was preceded by further mentions of the pronoun’s referent. Figure 1 reveals a clear difference between the p-pronounhe and the d-pronounthe (Fisher’s exact test, p <0.001). In the majority of all cases (73.0%), the d-pronoun refers to a referent that has been mentioned only once in the preceding context. The number of cases in which the referent of the d-pronoun was mentioned more than once declines rapidly. For the p-pronoun, single-mention referents are the most frequent category too, but with 35.7% of all cases, they occur less often than referents that are mentioned more than once. Among the 64.3% cases where a referent is mentioned more than once, the highest value is found for examples in which the referent of the p-pronoun is mentioned twice, but cases in which the referent of the p-pronoun is mentioned three times or more also occur with some regularity.

43As explained above, the numeric variable “number of mentions” was converted into a categorical variable “givenness” with the two values ​​“given” (number of mentions> 1) and “new” (number of mentions = 1). The results for givenness defined in this way and the two syntactic prominence properties of syntactic function and clausal position are shown in Table 2. For each property, this table shows separate percentages for p-pronouns and d-pronouns.

Table 2. Percentages of given vs. new, subject vs. non-subject, and non-final vs. final antecedent NPs, depending on pronoun type

GivennessSyntactic functionClausal position

44 For each property shown in Table 2, p-pronouns and d-pronouns behave in opposite ways (givenness: χ2 = 119; syntactic function: χ2 = 260; clausal position: χ2 = 137; Alles p-values ​​<0.001). For the p-pronoun, the value that increases accessibility always occurs much more frequently than the value that decreases accessibility. The asymmetry is strongest for the syntactic function of the antecedent (84.7% vs. 15.3%) and weakest for the givenness of the antecedent (64.3% vs. 35.7%). The reverse pattern is found for the d-pronoun: in each case, the accessibility-decreasing value occurs more often than the accessibility-increasing one. Here, all three properties show about the same ratio of about 70:30. With regard to the effect of syntactic function, the results in Table 2 are close to those found by Bosch et al. (2003). When the pronoun’s antecedent was contained in the immediately preceding clause, Bosch et al. found that the antecedent for a p-pronoun was a subject in 86.7% of all cases whereas the antecedent for a d-pronoun was an object in 76.4% of all cases. The values ​​found in our study are 84.7% subject antecedents for the p-pronoun and 70.2% object antecedents for d-pronoun. Both corpus studies thus find that the subject bias for p-pronouns is stronger than the object bias for d-pronouns.

45 The joint distribution of the three properties included in Table 2 is shown in Figure 2 for both the p-pronounhe and the d-pronounthe. What is most striking is that the two graphs are approximately mirror images of each other. In particular, by far the largest area for the p-pronoun in Figure 2 corresponds to the feature combination “+ subject, -final and given”, taking up 43.2% of all cases. For the d-pronoun in Figure 2, in contrast, the opposite feature combination “-subject, + final and new” takes up 45.4% and is thus as dominant as its counterpart for the p-pronoun. A further noteworthy finding revealed by Figure 2 is that for both pronouns, our corpus sample contains examples for all eight feature combinations. Thus, the three major properties defining accessibility discussed so far do not provide a categorical distinction between p-pronoun and d-pronoun, neither alone nor in combination.

Figure 2. Joint distribution of the three properties “syntactic function”, “position” and “givenness” of the antecedent of the p-pronounhe and the d-pronounthe


46 In sum, of the core properties defining salience according to Arnold (2010), givenness and syntactic prominence (syntactic function and clausal position) strongly differ between p-pronouns and d-pronouns. The antecedent of a p-pronoun is typically a given NP that occurs as subject in a non-final clausal position. The antecedent of a d-pronoun, in contrast, is typically a new NP that occurs as non-subject in clause-final position. Recency, in contrast, only showed a minimal difference between p-pronouns and d-pronouns. Both pronouns seem to require an antecedent that occurred recently and is therefore in an activated state in working memory. In this respect, pronouns differ from lexical NPs, which often take their antecedent over a longer distance (see Arnold, 2010 for discussion). The similar behavior of p- and d-pronouns with regard to recency can be attributed to the fact that both have the same impoverished lexical content (“masculine singular”).

3.3.2. Definiteness

Table 3. Relationship between definiteness and givenness of the antecedent NP

P-pronounD-pronounProper nameDefIndefW-word
N 165       8 192 335 136     7
Percentage given97.0100.051.630.8 7.4   0.0
Percentage new 3.0   0.048.469.292.7100.0

47Before discussing how the definiteness of the antecedent NP influences the choice between a p- and a d-pronoun, we first consider the relationship between the antecedent’s definiteness and its discourse status as given or new. Definiteness and givenness are expected to correlate. For example, when the antecedent is itself a pronoun, it must be given, that is, it must be preceded by some non-pronominal NP in the prior discourse. On the other hand, when the antecedent is an indefinite NP, it has likely been newly introduced to the discourse. In order to assess how strong the correlation between definiteness and givenness of the antecedent is, Table 3 shows the percentages of given and new uses for each definiteness category of the antecedent NP. When the antecedent was a p- or d-pronoun itself, it was almost always given. The fact that p-pronoun antecedents were given only 97% of the time is due to restricting the prior context to five sentences. The opposite behavior is found for indefinite NPs and w-words, which are new in the overwhelming majority of cases. Proper names and definite NPs are in between, with a slight preference for given antecedents for proper names and a moderate preference for new antecedents for definite NPs. The finding of 69.2% new uses for antecedents that are definite NPs may seem surprising if one considers the anaphoric use of definite NPs as basic. However, prior research has demonstrated that the use of definite NPs without a textual antecedent is quite common. For example, Fraurud (1990) found that 60.9% of the definite NPs were new in his corpus study. With 69.2%, the value that we found is only slightly higher. This slight increase is possibly again due to restricting the context to five sentences.

Table 4. Percentages (n) of definiteness categories of antecedents depending on pronoun type

P-pronounD-pronounProper nameDefIndefW-word

48 Table 4 shows that the p- and the d-pronoun are associated with different distributions of the antecedent’s definiteness (Fisher’s exact test, p <0.001). For the p-pronoun, the two most frequent definiteness categories are pronoun and definite NP, followed closely by proper name. Taken together, these three categories account for about 90% of all antecedents for the p-pronoun. For the d-pronoun, the three most frequent categories are proper name, definite NP and indefinite NP. These three also account for about 90% of all antecedents for the d-pronoun. There are two major differences between the p- and the d-pronoun. First, the p-pronoun’s antecedent is a p-pronoun itself in a substantial number of cases, whereas p-pronouns are rarely the antecedent for the d-pronoun, although some cases still occur. The second major difference is that indefinite antecedents occur much more often with the d-pronoun than with the p-pronoun.

49 These differences notwithstanding, Table 4 shows a large overlap between the p- and d-pronoun. For both pronouns, definite NPs and proper names together account for the majority of all antecedents (58.5% for the p-pronoun and 66.4% for the d-pronoun). For these two antecedent types, Table 5 shows how often the antecedent was already given in the prior discourse and how often it was newly introduced.

Table 5. Percentages (n) of given vs. new antecedent NPs depending on pronoun type, for proper name antecedents and definite antecedents