Literary Metrics


Feb.2005



Consider the problem of quickly and efficiently determining the nature of a body of text. To a human, it is usually very easy to decide whether the text constitutes a news article, a poem, a full-length novel, an essay, or an enumeration of some kind, like a food recipe. In the computative realm, however, this problem becomes more difficult. We are interested here, among other things, in evaluating the ability of relatively simple algorithms to produce simplified numeric profiles which can be used to group and compare bodies of text.

Our first experiment will examine the insight given by the short list of numbers signifying weighted frequency of word lengths. The coefficients are arrived at by calculating, for each of a set of word lengths, the percentage of words in the text that are of that length, and multiplying it with a weight factor, which grows exponentially for ever-larger word lengths. This adjusts for exponential decrease in percentage as we look at larger lengths, giving us more descriptive curves. We will examine five bodies of text: three classic novels (Daniel Defoe's Robinson Crusoe, Charles Dickens' A Tale of Two Cities and H. G. Wells' The Invisible Man), a collection of randomly chosen news articles from CNN.com, and an essay by Herbert Marcuse (Aggressiveness in Advanced Industrial Society).



Table of weighted word coefficients by length:

Word Length R. Crusoe Tale of 2 Cities Invisible Man CNN.com articles Marcuse
5 4 3 3 5 3
6 5 5 5 6 5
7 6 8 9 12 10
8 7 11 11 19 15
9 10 15 16 26 28
10 13 15 18 29 54
11 13 16 18 25 67
12 10 21 25 34 87
13 12 22 29 39 61
14 6 17 22 53 91
15 5 13 18 31 66


The differences between the three classic novels and the denser, more compact recent writing are apparent in the curves. This isn't surprising, as the types of writing in question differ substantially in many ways, but it is encouraging to see the potential of relatively simple numeric evaluations in categorizing qualitatively different texts.

However, without any understanding of language, can simple (this qualification is important) algorithms help a human user understand the general idea of a body of writing? We can quickly envision a number of very rudimentary tools which could be included in a software package the purpose of which would be to help its users "get the gist" of large texts.

In the following examples, we'll run some simple logic on The Invisible Man. For instance, we might start with checking what some common phrases are:


Length 3 Length 4
THE INVISIBLE MAN (60)
OUT OF THE (47)
THERE WAS A (29)
SAID THE INVISIBLE (26)
FOR A MOMENT (23)
IN ANOTHER MOMENT (21)
SAID THE VOICE (19)
ONE OF THE (18)
AND THERE WAS (17)
BACK TO THE (14)
SAID THE INVISIBLE MAN (14)
THE MAN WITH THE (12)
THE DOOR OF THE (10)
AND THERE WAS A (10)
FOR THE MOST PART (9)
THE COACH AND HORSES (9)
FOR A MOMENT AND (9)
OUT OF THE WINDOW (8)
IN ANOTHER MOMENT HE (7)
THE CORNER OF THE (7)
Length 5 Length 6
THE MAN WITH THE BLACK (7)
SAID THE MAN WITH THE (6)
IN THE MIDDLE OF THE (6)
MAKE SO BOLD AS TO (5)
MAN WITH THE BLACK BEARD (5)
IN ANOTHER MOMENT HE WAS (4)
FOR THE MOST PART IN (4)
THE BURGLARY AT THE VICARAGE (4)
PUT IT DOWN IN THE (4)
FROM THE DIRECTION OF THE (4)
THE MAN WITH THE BLACK BEARD (5)
SAID THE MAN WITH THE BLACK (5)
MAN WITH THE BLACK BEARD AND (4)
IS TO THE MIND OF A (3)
HE WAS STRUCK IN THE MOUTH (3)
OUT HIS CHEEKS AND HIS EYES (3)
MAKE SO BOLD AS TO SAY (3)
WENT TO THE WINDOW AND STARED (3)
IT IS KILLING WE MUST DO (3)
WAS THE STRANGEST THING IN THE (3)


From these outputs, we see that the "invisible man" and "the man with the black beard" are major characters. We also see that there is violence and crime at some point in the novel, revealed by the phrases "it is killing we must do", "he was struck in the mouth" and "the burglary at the vicarage". We may, furthermore, wonder how widespread in the text some of these phrase are. For instance, we can easily produce a graphic representation of the occurence of certain phrases in the text. If we check for "invisible man" and "man with the black beard", we get the following outputs:







We can see that while "the invisible man" seems to be involved in the entire body of the book, "the man with the black beard" is present only for a brief and relatively intense (he registers among the most common phrases in the novel) episode around the book's middle.

Continuing in this style we can see that, without the need of implementing grammar-based language parsers and other relatively complex mechanisms for natural language processing, a conglomeration of a large set of simple tools for quick textual queries would make a useful application which can enable rapid insight into specific aspects of a large body of text.