  E Bruce Brooks
    To: Synoptic
On: Word Statistics and Text Relations
From: Bruce
    Feb 28, 2012
    • 0 Attachment
      To: Synoptic
      Cc: GPG, WSW
      In Response To: Ron
      On: Word Statistics and Text Relations
      From: Bruce

      I think I may have mentioned this before, so my responses are probably
      superfluous. But the details in question seem to me important as matters of
      technique (not quite at the level of methodology, but methodology assumes
      technique), so I take the risk of repeating myself.

      RON: Suppose we have two documents, A and B, and we have reason to suspect
      that parts of one were copied from the other.

      BRUCE: What reason? That is certainly part of the question, and it would be
      better to have the whole question before us. Specific example perhaps
      preferable. Sometimes these things are cumulative, in which case a decision
      reached on only one component may be faulty. The whole squirrel may be able
      to run faster than the separated leg of the squirrel.

      [I will also enter a plea against the word "copied." Authors are not
      mediaeval scribes, trying to reproduce a Vorlage exactly. They are
      independent writers, aware of and sometimes influenced by something out
      there in the circumambient textual world. The situation is different, and
      the vocabulary should also be different. Luke is not a failed scribal copy
      of Mark. It is an independent, and at points hostile, recreation and
      supplementation of Mark, along certain well-defined doctrinal lines, and
      with a different stylistic ambition].

      RON: We observe that there are (say) 6 instances of an unusual word or
      phrase in document A and one instance of the same unusual word or phrase in
      document B having the same context as that of one of the 6 in document A.

      BRUCE: Unusual to whom? NT (as it presents itself to this outsider) is
      plagued with a fascination for hapax legomena as such. That term divides the
      wordstock of the language into two piles, remarkable and ordinary. I find
      that it is more powerful, in applying word tests, to use general language
      frequency as a base expectation. Why? Because the wordstock is not divided
      into two piles, it is graded into a continually ascending series of
      occurrence expectations. The remarkableness of occurrence of any word in any
      text thus depends on the general frequency of the word in question, and is
      proportional to the length of the piece. It is accordingly dubious to say "6
      occurrences" and leave out of account whether the passage in question is 60
      or 6000 words long. For such reasons, percentages are usually preferable in
      careful work. Not only do they give "unusual" a more precise meaning, and
      allow a more nuanced application, but they permit us to notice cases where
      a word, so common that we expect it to occur in a piece of a given size, is
      missing from such a piece. Not an argument from silence, but rather one from
      nonoccurrence. I think that a complete technique should include the
      possibility of these negative observations.

      [There exists a short piece on how to estimate the weight of a word absence.
      I attach it here, though it may not make it through the attachment filter of
      one or more of the lists here addressed. If anyone cares to see the piece
      but did not receive it in distribution, they are welcome to write me

      RON: Other things being equal, common sense suggests that it is more likely
      that the writer of document B was copying from document A rather than the
      other way round. This is because both the alternative explanations suggest a
      rather strange coincidence. Either the writer of A just happened to develop
      a fondness for this unusual word or phrase as a result of noticing it in
      document B.

      BRUCE: Or for some other reason. The problem with rare words is that, on
      first occurrence, they still count as rare, but repeated occurrences (say, 6
      of them) in the same passage do not count as "6 times as unusual." They
      count as only a little more unusual than 1 occurrence would be: perhaps
      nearer 1.8 than 6. Why? Because texts are literary objects, not random
      number generators or beta decay experiments, and a writer who uses a rare
      word once (say, in a story about a Gerbil, or a sketch of a person named
      Geronimo) will tend to use it repeatedly because the situation invites, or
      even requires, such repetition. These are the literary facts, and they
      preclude taking the numbers simply as numerical facts. The numbers are a
      penumbra cast by the lit; they have no independent statistical reality of
      their own.

      RON: Or the writer of document B just happened to make use of an unusual
      word or phrase of which the writer of document A was already particularly

      BRUCE: Again, we would do better to have the word in question exhibited. Is
      it one whose horizon of normal occurrence is stories of a particular type
      (that is, a generic word, associated with a genre rather than with the
      language as a whole), and do both A and B belong to that type? In that case,
      both the occurrence statistics may be genre-induced, and thus not
      significant of any lateral relation between the two. A story that begins
      "Once upon a time" is not necessarily copying a story that also begins "once
      upon a time." It may be merely following the conventions of that kind of

      Numbers are good once we establish the zone in which they, or the
      statistical expectation that often accompanies them, properly apply. We can
      then turn on our calculator and proceed. But up to then, we need to use our
      literary sensibility. Is one 4c martyrdom tale indebted to another, or are
      they both simply 4c martyrdom stories? Does 1 Clement quote Ephesians here,
      or do both texts invoke an established prayer or creed or other ritually
      fixed form of words?

      For a good textbook example of the weighing of text relationships, largely
      centering on but not analytically limited to, the weighing of word
      relationships, I can still commend the sogenannte Oxford Committee's
      collaborative 1905 work, The New Testament in the Apostolic Fathers
      (recently and adequately reprinted by Kessinger). The purpose is to see what
      texts the chosen Church Fathers (one of whom, amusingly, is the Didachist)
      were aware of. The final decisions tend to be severe (some relations which a
      given author deems unmistakable are coded in the final table only as Grade
      C), but immersing oneself in the discussion seems to me a good way to get a
      sense of the kind of contextual considerations that properly apply to these
      judgements. Also relevant, and also considered in the prefatory matter to
      each section, is the style of a given writer in quoting texts, which (for
      example) often differs between OT and NT material. This is what we might
      call calibration: establishing, before we investigate one instance, what we
      can reasonably expect of that instance in the way of verbal precision or
      explicit acknowledgement. (In the NT case, but not the OT case, few or

      And as a bonus, there are some very secure final judgements about who knew
      what when. The case for 1 Clement's knowledge of Ephesians, for example,
      though it gets only a D on the final scorecard (whereas Romans gets an A),
      seems to me incontrovertible. Of particular moment, in my estimation, is
      that the possible resemblances to Ephesians are not scattered over 1
      Clement, but in some cases occur bunched (see p53). So here is another point
      of consequence: not simply word count, but also word distribution.

      I don't think that practical use of word data in estimating text
      relationships can well afford to leave these things out.

      Respectfully suggested,


      E Bruce Brooks
      University of Massachusetts at Amherst

      Modest Reminder: Those without the Oxford monograph can easily order it via:


      with small but not negligible incidental benefit to the cause of analytical

