























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The use of repetition and redundancy in I-novels and Romantic fiction through various measures, including type-token ratio, entropy, and long-range dependencies between sequences of characters. The authors aim to identify a generically distinctive tendency in these genres that transcends the meaning of the words on the page.
What you will learn
Typology: Slides
1 / 31
This page cannot be seen from the preview
Don't miss anything!
Peer-Reviewed By: Tomi Suzuki
Clusters: Genre
Article DOI: 10.22148/16.
Dataverse DOI: 10.7910/DVN/DAKZHR
Journal ISSN: 2371-
Cite: Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu, “Self-Repetition and East Asian Literary Modernity, 1900-1930,” Cultural Analytics May 21, 2018. DOI: 10.22148/16.
Histories of East Asian literary modernity have often begun as historiographies of the narrative self. For some scholars, the emergence of a decidedly self-referential mode of fiction in the early twentieth century is part and parcel of what defines this modernity.^1 In Japan there was the “I-novel”; and in China, Romantic fic- tion. The two are recognized as foundational genres that distinguished them- selves from prior fiction by the adoption of a narrow autobiographical focus, ex-
(^1) On the China side, see Robert Hegel and Richard Hessney, eds., Expressions of Self in Chinese Literature (New York: Columbia University Press, 1985), in particular the contribution of Leo Ou- fan Lee, “The Solitary Traveler: Images of the Self in Modern Chinese Literature” (282-307); Jaroslav Průšek, The Lyrical and the Epic: Studies of Modern Chinese Literature (Bloomington: Indiana Uni- versity Press, 1980); and Lydia Liu, Translingual Practice: Literature, National Culture, and Translated Modernity—China, 1900-1937 (Stanford: Stanford University Press, 1995). For Japan, see Karatani Kōjin, Origins of Modern Japanese Literature, ed. Brett de Bary (Durham, NC: Duke University Press, 1993); James Fujii, Complicit Fictions: The Subject in the Modern Japanese Prose Narrative (Berkeley: University of California Press, 1993); and Janet Walker, The Japanese Novel of the Meiji Period and the Ideal of Individualism (Princeton, NJ: Princeton University Press, 1979).
Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu Cultural Analytics
tended psycho-narration, and a new vernacular writing style. At the same time, others have struggled to define these genres in more precise stylistic or formal terms. Edward Fowler once said of the I-novel that “writing about [it] is not unlike pursuing a desert oasis…How is one to analyze a form that critics have de- bated for well over half a century but for which they have failed to come up with a workable definition?”^2 The case is similar for China’s Romantic fiction, which, since the work of C.T. Hsia and Leo Ou-fan Lee, has been conventionally defined by its social milieu rather than through a coherent set of generic qualities.^3
Definitional ambiguity is integral to how literary scholars understand genre: the identity of any one text will always be overdetermined. Equally important is the conceit that groups of texts can cohere in ways that differentiate them from oth- ers. In this paper we use computational methods to argue that a heightened tendency toward lexical repetition was a significant point of coherence for the narrative practices now captured under the signs of “I-novel” and Romantic lit- erature. By tendency we mean something less than an essential trait found in all self-referential works but more than a minor feature found in only a few. The presence of this tendency in both cultural contexts prompts us to think about the role of repetition in literary style, but also of repetition as literary style. On the one hand, we suggest that repetition indexes specific formal transformations that I-novels and Romantic literature are commonly associated with: the vernac- ularization of writing and the adoption of Western grammatical structures. On the other, we propose that repetition also relates to changes at the level of con- tent, in particular an emphasis on narratives of psychological realism and mental aberration. In this way, repetition as style becomes a way to identify, as a kind of surface phenomenon, a deeper and more complex set of interactions taking place between intellectual figurations of self and concrete linguistic strategies. Examining this surface through a computational lens, we propose, opens up a new comparative framework for analyzing the effects of these interactions across the space of East Asian literary modernity.
Our argument is divided into three sections. In the first we establish our ratio- nale for associating repetitiveness with a set of qualitative traits that scholars have previously ascribed to Japanese I-novels and Chinese Romantic fiction. After col- lecting a set of measurable linguistic features that capture various kinds of repe-
(^2) Edward Fowler, The Rhetoric of Confession: Shishōsetsu in Early Twentieth-Century Japanese Fic- tion (Berkeley: University of California Press, 1988), 3. (^3) Hsia suggests that, beyond the work of a handful of representative individuals, Romanticism’s only distinguishing quality is a “maudlin sentimentality... completely deficient in restraint and objectivity.” See A History of Modern Chinese Fiction, Second Ed. (New Haven and London: Yale University Press, 1971), 95. Likewise, Lee concludes that Romanticism was definable largely by way of group libraries and clashes of personalities. See The Romantic Generation of Modern Chinese Writers (Cambridge, MA: Harvard University Press, 1973), 22.
Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu Cultural Analytics
rative fiction.^5 Ambiguity as to whether the I-novel and Romantic fiction are meaningful generic labels has even led to extreme relativist claims that deny the existence of a coherent form or genre at all; these labels, so the argument goes, are mere discursive and ideological paradigms through which any text can be read.^6 Some scholars, while not denying there is variation within this literature, have proceeded from the opposite assumption, treating the I-novel and Roman- tic fiction as genres bound by distinct formal or empirical patterns. Focusing on narrative structure, rhetorical style, or social and media context, they try to isolate a set of features that hold these texts together.^7
Our aim in this paper is not to rule over this long-running genre debate, a foreclo- sure that would in any case be antithetical to our role as literary critics. There is no single way to resolve such a debate because every attempt to argue for or against the ontological reality of these genre labels is predicated on different assumptions about the unit(s) of comparison. Is it the author? The ideal reader? Some aspect of the text? Here, we explicitly focus on shared linguistic patterns. They afford a scale of comparison that can encompass many hundreds of texts and multiple linguistic contexts. But they also provide a level of granularity through which we can potentially observe the stylistic tendencies that came together to instanti- ate the modern self as a literary construct. Or, to borrow from Franco Moretti’s analysis of bourgeois style, as a “mentality” made of “unconscious grammatical patterns and semantic associations, more than clear and distinct ideas.”^8 The first question we needed to answer was whether any such mentalities existed in the body of fiction grouped under the “I-novel” and “Romantic” labels.
As mentioned at the outset, there are several higher order phenomena that char- acterize this body of fiction. Scholars have long noted that its rise is intimately tied up with consolidation of the modern written vernacular under the genbun- itchi and baihua movements in Japan and China, respectively. Others point to the widespread experimentation with imported narrative techniques and Euro-
(^5) For a thorough review of this criticism, particularly the contributions of Ito Sei, Hirano Ken, and Kobayashi Hideo, see Fowler, chapter 3; and Irmela Hijiya-Kirschnereit, Rituals of Self-Revelation: Shishōsetsu as Literary Genre and Socio-cultural Phenomenon (Cambridge, MA: Council on East Asian Studies, Harvard University, 1996), chapter 9. (^6) See Tomi Suzuki, Narrating the Self: Fictions of Japanese Modernity (Stanford, CA: Stanford Uni- versity Press, 1996), 5-6. (^7) For China, see Edward Gunn, Rewriting Chinese: Style and Innovation in Twentieth-Century Chi- nese Prose (Stanford: SUP, 1991); Liu, Translingual Practice; Haiyan Lee, Revolution of the Heart: A Genealogy of Love in China, 1900-1950 (Stanford: Stanford University Press, 2007); and Raymond Hsu, The Style of Lu Hsun: Vocabulary and Usage (Hong Kong: Centre of Asian Studies, University of Hong Kong Press, 1979). For Japan, see Fowler, Rhetoric of Confession; Hijiya-Kerschnereit, Rituals of Self-Revelation; and Barbara Mito Reed, ”Language, Narrative Structure, and the Shōsetsu” (diss. Princeton University, 1988). (^8) Moretti, The Bourgeois: Between History and Literature (London: Verso, 2013), 19.
Cultural Analytics Self-Repetition and East Asian Literary Modernity
peanized grammar that happened in conjunction with vernacularization, but which was necessarily distinct from it.^9 On the one hand, these imports include things like free indirect discourse, lengthy interior monologue, and a rejection of emplotment.^10 On the other, they include use of personal pronouns, inanimate subjects, Western syntax, and an exaggerated specification of subject/object re- lations. Indeed, much attention has been given to how Japanese and Chinese, as non-inflected languages that traditionally allow for great flexibility in whether or not to specify the grammatical subject, were simultaneously leveraged and de- formed in the creation of new structures of self-narration. In the Japanese case, it has been argued that this flexibility allows for the slippages between narratorial authority and character viewpoint that blur the I-novel’s status as realist fiction.^11
While these complex developments in literary language provide an important foundation for understanding what was unique about self-referential fiction, they do not scale terribly well as features nor do they necessarily separate out such fic- tion from other contemporary genres that similarly adopted a vernacular style or Western grammatical structures. Our goal was thus to find a set of quanti- tative measures that allowed us to compare hundreds of texts while potentially singling out linguistic tendencies that indexed these higher order phenomena in self-referential fiction. In practice, this meant creating effective proxies for these phenomena that captured some aspect of their impact on literary language. From the perspective of plot and narrative, we reasoned that the more intense psycho- logical focus of these texts might lend itself to a narrowing of the semantic field and less lexical diversity as compared with plot driven works and their more dy- namic narrative focus. In other words, did I-novels and Romantic fiction tend to concentrate their lexical attention on a smaller vocabulary? And, from the per- spective of style, we hypothesized that one result of the shift to vernacular writing might be increased repetition or redundancy in the language. The adoption of
(^9) For China, see Liu and Gunn. For Japan, see Kisaka Motoi, Kindai bunshō seiritsu no shosō [Various Aspects of the Formation of Modern Style] (Osaka: Wazumi shoin, 1988), Chapter 3. On the distinction between a vernacular style and shifts in the conceptual and grammatical structure of written Japanese, see also Karatani, 49-51. The two are typically seen as distinct, but interrelated movements in the development of the new literary language known as genbun itchi. (^10) With regard to emplotment, I-novels, for instance, have been described as “tedious” descriptions of “one’s life and nothing else” (Yasuoka Shōtarō, 25); “fragmented and short-winded” (Yokomitsu Ri’ichi, 52) or otherwise “random” accounts of personal experience (Kume Masao, 46); a “medium for intimate expression that would suffer from too much attention to structure” (Ito Sei, 63); and “a string of impressionistic musings” (Uno Koji, 7). All cited in Fowler. On the China side, Yu Dafu’s works have been singled out for emphasizing journeys that are “incomplete, aimless, and marked with uncertainties.” Cited in Liu, 149. And Guo Moruo famously responded to early criticism of one of his works by saying ”that it was a mistake to read his story as a straightforward narrative with a beginning, a climax, and an ending—he was trying to present the unconscious in the form of dream symbolism.” Cited in Liu, 131. (^11) See Reed, 144-169; Fowler, Chapter 2; and Liu, 153-54.
Cultural Analytics Self-Repetition and East Asian Literary Modernity
particular sample of writing or speech, were more of the same words being re- peated at a higher frequency or were many different words being used with less frequency? In 1935, George Zipf developed his eponymous law stating that the distribution of word frequency ranks in a given sample of natural language obey a power law, such that the frequency of any word is inversely proportional to its rank in the frequency table (i.e., the most frequent word will occur twice as many times as the second most frequent word, and so on). In 1938, John B. Car- roll developed a diversity measure based on the observation that the growth of word diversity with text size must approach a limit. His measure focused on how often frequent words tended to be repeated in a passage, and he asserted that measures like this could help assess the relative adherence of one’s verbal behav- ior to linguistic norms.^15 The next year, Wendell Johnson introduced the notion of type-token ratio (TTR): the number of unique words in a text divided by the total number of words. He suspected the ratio might serve as “a measure of de- gree of frustration, or of disorientation,” and that it could serve to quantify the phenomena of “one-track mind,” or “monomania.”^16 The 1940s saw further at- tempts to build on these foundational measures in order to assess how repetitive, uniform, or concentrated was the vocabulary of a given segment of text. Some of these measures, to be described shortly, had the advantage of being less sensitive to variation in text length and of being able to dampen or ignore the influence of rare words.
Significantly, they also shared a mathematical relation to a measure that grew to become highly influential in the 1950s and represented a different approach to the problem of repetition: entropy. Following the work of Claude Shannon and Warren Weaver at Bell Labs, a number of psycholinguists began to approach rep- etition in more probabilistic ways, analyzing not just the diversity of words used but the predictability of their sequential order, or what were referred to as “tran- sitional probabilities.” They refocused ideas about repetition through the twin lenses of redundancy and information. In an information theoretic context, the amount of redundancy in a message (its entropy) reflects the amount of “infor- mation” in that message. Here, information means the likelihood of a message based on all the units available to constitute it, but also all the ways of combining these units given existing rules or patterns governing their arrangement. In short, information expresses how many different ways a message can be constructed given these initial constraints. An extremely information rich language, then, might be one where any given word is equally likely to appear next to any other. Every message in this artificial language would carry new information because
(^15) John Carroll, “Analysis of Verbal Behavior,” in Psychological Review 51 (March, 1944): 102-119. (^16) Wendell Johnson, Language and Speech Hygiene: An Application of General Semantics, Outline of a Course (Chicago: Chicago Institute of General Semantics, 1939), 11.
Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu Cultural Analytics
each would be as random and unpredictable as the one before it. The messages would also be wholly unintelligible, which is why all natural languages have some redundancy built into them.
While entropy proved a theoretically productive concept for many psycholin- guists, it also proved very tricky to measure in any holistic way. Not only does it vary with the length of the text being measured, it also varies with the unit of anal- ysis and the length of the sequence being considered. As a sequence grows longer, so too does the number of possible combinations with which to predict the ran- domness of the next item in this sequence. Entropy is thus biased by how much of a text or corpus is available to be measured and is also increasingly intractable as the number of units and their possible combinations increases. In practice, this meant that early applications of entropy to text were confined to smaller units of analysis (e.g., letters, syllables), because one could expect to see the fuller range of possible combinations in a given portion of text.^17 It also meant that focus remained on individual words or word pairs, such as in Gustav Herdan’s use of entropy to reason about how writers manipulated the variability of expression in their writing to avoid undue repetition.^18 When confined to the individual word level, entropy simply captures the spread of the total words of a sample amongst the different words available in that sample. The highest entropy passage in this case is one where every word is unique and different; the lowest is a passage where every word is the same, and thus highly redundant.^19
Despite the limitations of these various measures of lexical diversity and entropy, they do provide a baseline for quantifying the amount of repetition in a text. Us- ing this baseline, we first determined whether I-novels or Romantic fiction show an exaggerated tendency to repeat as compared with other fiction written con- temporaneously. Does the combination of a vernacular style, Western grammat- ical structures, and psychological focus translate into a narrower range of words being repeated more often? To answer this question, we first constructed cor- pora for each language. For Japan, we collected roughly 65 texts that scholars have specifically designated or read as belonging to the I-novel genre. We also included self-referential or psychological works by authors associated with the genre or by authors who briefly experimented with this mode of writing. The bulk of the works were published in the teens and twenties and represent about
(^17) See, for example, the work of Wilhem Fucks who, in 1952, tried applying information theory to stylometrics and compared the entropy of syllables in prose versus poetry. “On the Mathematical Analysis of Style,” in Biometrika 39, no. 9 (1952): 122-129. (^18) Gustav Herdan, Language as Choice and Chance (Groningen: P. Noordhoff, 1956), 167. (^19) For additional critiques of entropy as a valid measure of lexical richness or style, see, for exam- ple, P. Thoiron, “Diversity Index and Entropy as Measures of Lexical Richness,” in Computers and the Humanities 20, no. 3 (1986): 197-202; and David Hoover, “Another Perspective on Vocabulary Richness,” in Computers and the Humanities, 37, no. 2 (2003): 151-178.
Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu Cultural Analytics
style of vernacular (jiu baihua) markedly different from the vernacular modes de- veloped by Romantic writers. Thus the comparison in this case was carried out along the dimensions of content and linguistic style. These differences notwith- standing, our goal in both cases was to determine whether various measures of repetition and redundancy were sufficient to identify a generically distinctive tendency in I-novels and Romantic fiction that transcended the meaning of the words on the page.
The next step was thus to apply these measures. Because measures like TTR and entropy tend to be highly correlated with the length of the passage being mea- sured, we applied them such that the results would be independent of text length. For these two especially, this meant dividing texts into 1,000 word segments; mea- suring the TTR and entropy for these segments, including stopwords; and then computing the average, standard deviation, and cumulative sum across all seg- ments of a text (Equation 1).
Standard deviation tells us about the variance of TTR and entropy across all chunks, while cumulative sum tells us how much higher or lower than average the values tend to be. Aware that our entropy measure was tied to the marginal distribution of individual words, we also calculated entropy based on the joint distribution of words, taking their sequential nature into account. This method, borrowed from Ioannis Kontoyiannis, adopts a non-parametric approach that captures long-range dependencies between sequences of words or characters.^22 Here we chose to focus on sequences of individual phonetic and Chinese char- acters, such that lower entropy means more repetition of the same sequences of characters. While the window size for finding matching sequences is still depen- dent on the length of our shortest texts, biasing the resulting entropy estimates
(^22) I. Kontoyiannis, “The Complexity and Entropy of Literary Styles,” in NSF Technical Report, no. 97 (June 1996-October 1997): 1-15. In this case, it is non-parametric in the sense that it is not bound to the smaller contexts (unigrams, bigrams, etc.) of Markov-based entropy measures. Thus, for each position i in a text’s sequence of units (in our case, individual characters), the method looks for the longest sequence starting at i that does not occur prior to i. For example, at i = 100, it will scan for the longest sequence of characters that does not occur in the previous 100 characters. These lengths for various i are then used to estimate the entropy of the text as a whole.
Cultural Analytics Self-Repetition and East Asian Literary Modernity
to some degree, the estimates themselves are not correlated with text length.
Concerned that TTR and entropy alone provided too narrow a window onto rep- etition, we implemented two additional features related to entropy mathemat- ically but originally created as indexes of lexical diversity. The first is George Yule’s “Characteristic K,” developed in 1944 to measure the repetitiveness or uni- formity of vocabulary in a text. It relies on word rank and frequency for its cal- culation, relating the sum of all word frequencies to the number of words with a particular frequency, and was designed by Yule to be independent of sample size.^23 It also assumes that word occurrence in a given sample of text follows a Poisson distribution, treating words as fixed events that occur with a known aver- age rate for any interval (i.e., the length of the sample). Herdan later corrected for this assumption, developing a modified K that was widely adopted in the 1960s as a stylistic measure for the concentration of vocabulary, including attempts to analyze schizophrenic language.^24 Another feature we included is an index of lexical concentration developed, also in 1944, by French linguist Pierre Guiraud. “Guiraud’s C,” as it is known, expresses the proportion of a text’s cumulative word frequency taken up by its most 50 frequent “content” words. A high value of the index implies that “an author concentrates his attention on a relatively narrow range of words with full meaning,” which in turn testifies to “thematic compact- ness, to the concentration on the main theme, [and] in some cases also to stock phrases.”^25 This measure is more sensitive to text length than Yule’s K, and thus has less explanatory power, but its explanation is more intuitive. Both have the benefit of not requiring the splitting of texts into smaller chunks. And both, im- portantly, are akin to entropy in that they depend on the sums of relative word frequencies.^26
Examining these measures individually, we find that nearly all are good at dis- tinguishing I-novels and Romantic fiction from their popular contemporaries. The distributions of average TTR and entropy for the Japanese corpora indicate
(^23) See George Yule, The Statistical Study of Literary Vocabulary [1944] (Hamden, CT: Archon Books, 1968). The measure is calculated as follows: 10,000 x (M₂ - M₁)/(M₁ x M₁). M₁ is the number of word tokens. M₂ is calculated by multiplying the number of words at a given rank frequency by the square of that rank (e.g., all words occurring 2 times multiplied by 2²) and then summing over all of these values. (^24) Juhan Tuldava, “Stylistics, Author Identification,” in Quantitative Linguistics: An International Handbook, ed. Reinhard Köhler, et. al (Berlin: Walter de Gruyter, 2005), 374. See also Arthur Hol- stein, “A Statistical Analysis of Schizophrenic Language,” in Statistical Methods in Linguistics 4 (1965): 10:14. (^25) Tuldava, 375. Guiraud’s C is derived by summing the frequencies of the top 50 most frequent words and dividing through by the total number of words. (^26) On the relation of Yule’s K to entropy measures, see Kumiko Tanaka-Ishii and Shunsuke Aihara, “Computational Constancy Measures of Texts,” in Association for Computational Linguistics 41, no. 3 (2015): 481-502.
Cultural Analytics Self-Repetition and East Asian Literary Modernity
repetitive ones. A measure that did not show significant difference across cat- egories was Kontoyiannis’s entropy measure, suggesting that no group of texts had significantly more long-range dependencies than the other. It did, however, when analyzed in combination with other features, help to identify some self- referential texts that were repetitive in ways our word-based measures could not capture, a point to which we will return. Overall, we were surprised to find that most measures pointed toward greater repetitiveness in this mode of fiction and that, importantly, this tendency seemed to hold true across languages.
Because these measures alone gave no indication as to the possible reasons for in- creased repetition, our next step was to triangulate them with finer-grained lexi- cal and grammatical features. That is, we sought additional proxies for the higher order phenomena of vernacular style, grammatical structure, and self-referential content. This included obvious things like the mode of narration (whether first person or not) and the ratio of verbs related to thought and feeling.^29 It also in- cluded features likely to be associated with the influence of Western grammar and translated works: ratio of first or third person pronouns; ratio of punctua- tion; ratio of only periods; and ratio of grammatical function words (stopwords). On their own, all these features, aside from mode of narration, turned out to be reliable indicators of overall generic difference. We assumed this would be true of pronouns and “thought/feeling” verbs given the confessional and solipsistic na- ture of I-novels and Romantic fiction, but it was not obvious that this would be true for stopwords (which are more frequent in these works) as well as punctua- tion (which are less). A possible reason for the latter, at least in the Japanese case, is that the works contain less dialogue.^30 Self-contemplation, we can imagine, does not leave time for small talk. Plotting these finer-grained features against our measures for repetitiveness, the most interesting finding was a correlation between entropy and the ratio of verbs signifying acts of contemplation, feeling, and mental attention. This relation holds for both Japan and China regardless of whether the work is narrated in the first or third person, but also holds within each genre (Figure 2).
(^29) For Japanese, the words we included were the following: 思, 感じ, 考え, ⼼持, 気分, ⼼配, 気持, 考へ. On the China side, we included the following words: 想, 觉得, 知道, ⼼⾥, 晓得, 精神, 想起, 感到, 觉, 感 觉, 思想, 感情. (^30) We were not able to confirm this in the Chinese case because of less reliable OCR results for some of the popular texts. Further correction is necessary to ensure that punctuation accurately reflects the original texts.
Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu Cultural Analytics
Figure 2. Plots for the ratio of “thought/feeling” words against average entropy for Japan and China, with linear regression lines fitted by genre. In both cases, we can observe that as the ratio of “thought/feeling” words increases (horizontal axis), the mean entropy of the texts decreases (vertical axis), indicating more lexical repetition.
A comparison of the 100 most redundant passages with the 100 least redundant
Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu Cultural Analytics
Popular I-novel Popular 5.1 1. I-novel 0.9 5.
Chinese Corpora:
Popular Romantic Popular 12.3 0. Romantic 0.1 5.
Table 1. Confusion matrices for our logistic regression classifier. These matrices were produced using ten-fold cross validation and represent how often, on aver- age, the classifier predicted the assigned class label. In the Chinese case, we can see that “Popular” works were almost never classified as “Romantic” works, and vice versa. In the Japanese case, “Popular” works were slightly harder to distin- guish from “I-novels.”
Having discerned such a compulsion in early-twentieth-century self-referential fiction, it remains to be seen what this tendency means at the level of style or in terms of generating a new kind of “mentality.” And given the restricted defini- tion of repetition we are using, this tendency needs to be situated against other ways for demarcating and reading the meaning of repetition. What our measures precisely capture is the relative degree to which a writer repeats the same limited set of words within a 1,000 word window. The more he or she does so across many such windows in a given text, the more repetitive is the text overall. Our goal is to understand whether this sustained compression of vocabulary corre- sponds to particular linguistic modes, narrative situations, or subject matter, but also whether it generates particular aesthetic effects.
Of course, readers do not read a text in discrete, 1,000 word chunks. Repeti- tion as we measure it represents a narrow sliver of the many kinds of repeti- tion that might interest literary scholars. J. Hillis Miller catalogs other alterna- tives in Fiction and Repetition: “On a small scale, there is repetition of verbal elements: words, figures of speech, shapes or gestures, or, more subtly, covert repetitions that act like metaphors…On a larger scale, events or scenes may be duplicated within the text…A character may repeat previous generations, or his- torical or mythological characters…Finally, an author may repeat in one novel
Cultural Analytics Self-Repetition and East Asian Literary Modernity
motifs, themes, characters, or events from his other novels.”^33 Miller goes on to suggest that we interpret novels in part by noticing these recurrences, for “any novel is a complex tissue of repetitions and of repetitions within repetitions, or of repetitions linked in chain fashion to other repetitions.”^34 The problem, of course, lies in this noticing. As Gilles Deleuze observes, repetition of some thing or event is essential to it acquiring a fixed identity in one’s mind—and so too the mind of the reader—but this identity is always virtual to the extent that repeti- tion itself is posited by way of abstraction. We abstract out the infinite variations that intervene between one occurrence of a thing and the next in order to make the idea of repetition possible.^35 As readers, our noticing of repetition in a liter- ary text is always predicated on some method for delimiting the boundaries of repetition and for holding at bay all the myriad dimensions along which any two instances of a thing or event can differ.
This method is easy to articulate when one is working with individual texts or focusing on smaller units of analysis, like phonemes or words. Studies of alliter- ation, parallelism, or rhyme in poetry are exemplary in this regard. It becomes harder, however, as these units grow in complexity and as one tries to follow that repetition across more than a handful of texts. To trace the repetition of a theme or motif, for example, requires significant abstractions in order to fix the identity of that theme or motif across many instances. The less consistency there is in these abstractions, the harder it is to assert that the same thing is be- ing repeated, and the harder it is to provide a quantitative interpretation of this repetition, since repetition is meaningful to the extent that something is repeated more (or less) often than might be expected. Linguists who work on repetition are especially attuned to this fact, and thus take great care to clearly articulate both the object that is being counted and the background against which these counts acquire significance. A recent methodological survey, for instance, out- lines no fewer than ten forms that repetition might take, including absolute repe- tition (a simple frequency); positional repetition (an unexpected higher or lower frequency at a given position in a text); associative repetition (two things coincid- ing more often than expected in a given frame); and repetition in blocks (a thing is repeated according to a lawful distribution over blocks of text).^36 In each case, importantly, it is assumed that repetition makes quantitative sense only relative to existing patterns of usage, whether in terms of the thing itself, its use with re-
(^33) J. Hillis Miller, Fiction and Repetition: Seven English Novels (Cambridge, MA: Harvard University Press, 1982), 1-2. (^34) Miller, Fiction and Repetition, 2. (^35) James Williams, Gilles Deleuze’s Difference and Repetition: A Critical Introduction and Guide (Ed- inburgh: Edinburgh University Press, 2013), 11-12. (^36) For the full list, see Gabriel Altmann and Reinhard Köhler, Forms and Degrees of Repetition in Texts (Berlin: Walter de Gruyter, 2015), 5-6.
Cultural Analytics Self-Repetition and East Asian Literary Modernity
specific acts of repetition in speech or prose, instead focusing on dreams, games, and other forms of acting-out or repression.
The importance of linguistic repetition gained traction with the rise of psycholin- guistics in the 1940s and 1950s, whereby the interpretation of repetition as a win- dow into human psychology took a strongly quantitative turn. As William Levelt notes in his comprehensive history of the field, “it had suddenly become possible to quantify the amount of information transmitted between sender and receiver, its redundancy, transmission rate and noise in the channel, and so on.”^40 George Zipf ’s research on word frequencies was an early precursor to this transforma- tion, and his now famous law was motivated by his belief in a deep property of mind he called “the principle of least effort.” He derived this property from a model of communication in which speakers benefit from reducing “the size of [their] vocabulary to a single word” while listeners prefer to “increase the size of a vocabulary to a point where there will be a distinctly different word for each different meaning.”^41 It is the balancing out of these two forces in communica- tion that generates the smooth rank-frequency relationship described by his law. Yet this norm was defined by observed deviations from it. Specifically, Zipf ana- lyzed the recorded speech of autistic and schizophrenic patients and argued that a sharper negative slope in the rank-frequency relation meant a smaller set of words being overloaded with a greater set of meanings, suggesting that such pa- tients were less inclined to adjust their private languages to a common cultural vocabulary.^42
So, too, were other early psycholinguists like John Carroll and Wendell Johnson drawn to lexical repetition and diversity as indexes of deviation from social norms. Johnson participated in several studies in the early 1940s that used his TTR measure, among others, to compare speech and writing between adults and children, age groups, IQ groups, sexes, schizophrenics, and normal adults.^43 These studies found that higher IQ correlates with higher lexical diversity and higher TTR; that the college freshman’s TTR is slightly higher than the schizophrenic’s; and that speech on the telephone is more repetitive than schizophrenic speech. The notion that lower diversity in word use, and greater repetition, signaled abnormal conditions of some sort (e.g., less education, less ability to relate to others, or extreme orality) played an important part in early
(^40) Levelt, A History of Psycholinguistics: The Pre-Chomskyan Era (Oxford: Oxford University Press, 2013), 5. (^41) George Zipf, Human Behavior and the Principle of Least Effort (Cambridge, MA: Addison-Wesley Press, 1949), 21. The thesis was originally formulated in The Psycho-Biology of Language: An Introduc- tion to Dynamic Philology (Boston: Houghton Mifflin Company, 1935). His theories are summarized in Levelt, 453. (^42) Zipf (1949): 285-87. (^43) Levelt, A History of Psycholinguistics, 456.
Hoyt Long, Anatoly Detwyler, and Yuancheng Zhu Cultural Analytics
psycholinguists’ ideas about language and cognition. Later on, entropy too, and its companion redundancy, would become compelling frameworks for thinking about the psychology of language, whether in Roman Jakobson’s musings on language as a code whose conventions differ between inner, affective language (which tends to be more redundant) and exteriorized, intellectual language; or Anthony Wilden’s use of redundancy to reinterpret Freud’s description of psy- chic symptoms as revealed in multiple, over-determined ways. The compulsion to repeat, he argues, is really a safeguard against inner mental noise.^44
Thus do forms of repetition help define and even construct the modern, psycho- logical subject. This brief history adds another essential dimension to the rich hermeneutic space through which repetition can be read. As we have seen, it has offered a scale along which to imagine differences between orality and writing; between inner language and exteriorized speech; between isolating psychologi- cal conditions like schizophrenia and normative, socially-aware subjectivity. By quantifying the repetitive tendencies of I-novels and Romantic fiction, we gain access to this space at the scale of hundreds of texts. Our measures also help us to orient texts within this space along a continuum. We can do so in terms of their relative redundancy, but also by considering the extent to which their measured features, as a composite, cohere with the features observed in one genre and not the other. The following plot shows the Japanese texts most likely to be “I-novels” as judged by our classifier and the features in our model (Figure 3). The higher a work appears in the plot, the more confident the classifier is that the work shares the quantitative tendencies observed in other “I-novels” in our corpus.
(^44) Roman Jakobson, “Langue and Parole: Code and Message,” in On Language, eds. Linda R. Waugh and Monique Monville-Burston (Cambridge, MA: Harvard University Press, 1990): 97-98; and Anthony Wilden, System and Structure: Essays in Communication and Exchange (London: Tavi- stock Publications Limited, 1972), 35-37. More recently, the Stanford Literary Lab, in a study of the differences between popular and canonical novels, has hinted at a potential link between repetition, as measured by TTR, and narratives of trauma. See Mark Algee-Hewitt, et al., “Canon/Archive” (2015), 9-10.