Macbeth and XML


Shakespeare's plays are available in XML form at, e.g. Macbeth. The DTD that defines the structure of these documents can be found at

Why XML?

XML is designed to be readable by both human and machine, but Macbeth in raw XML code is not my idea of a good read. The strength of XML is that its easy to transform it to other forms.

Different purposes

Someone handling the props of a Macbeth production, will have other needs than someone reading the play by the fireplace. The latter would find little interest in a list of all props that each persona needs.

Different presentation formats

In an HTML version, it makes sense to supply the title attribute "king of Scotland" each time the content of the speaker element is "DUNCAN". In print, presenting the speaker as "DUNCAN, king of Scotland" each time would be silly, and a waste of paper. In a WAP version, the play would have to be split into a lot of WML pages, if nothing else, so because of memory limitations in the WAP devices.


The markup doesn't represent all relations between persona, speakers and persons that are spoken of. or mentioned in stagedir elements. "The merciless Macdonwald" is not found in the personae element. Shakespeare wrote for the production team, they don't need information on Macdonwald, because he requires no actor, no costume, no props. The literary reader could still want to know who the man is?

Likewise, the stagedir element containing "Enter DUNCAN" does not, in any way visible to the XML parser, connect with the speaker that says "What bloody man is that? He can report", or the persona "DUNCAN, king of Scotland"

Not all speakers are represented as persona. E.g. "First Witch" is not. "Three Witches" is a persona, a human can figure out that the first is one of the three, but an XML parser can't.

For the pgroup element, there is a nice distinction between persona names, and grpdescr description. But for non-grouped persona elements, the name and the description is combined in one string, thus breaking the link between persona and speaker elements.

Shakespeare's descriptions rely on context. Reading about "DUNCAN king of Scotland and his sons MALCOLM and DONALBAIN" makes sense, but when the description is removed from its context by an XML transformation, being told that MALCOLM is "his sons" is just non-grammatical nonsense.


are quite special stage directions, probably worthy of their own elements.

Seems like unspecified "Exit" means that previous speaker exits. From an XML point of view, it would be prettier if the exit was a child or attribute of the speach—mainly because child/parent relations yield more mainstream XSLT code than previous/next.


I think it would be a good thing to let persona, act, , , scene, speach and line have ID attributes. Persona ID is useful to establish a connection when speaker and persona content doesn't match, like in "First Witch" speaking, wish is the "Three Witches" persona.


<speaker>ALL</speaker> is a problem, it may refer to all three withces, or all apparitions, or sometimes maybe all 6 witches and apparitions. Worse: the difference is probably not even important, what's important is that lots of scary things are howling at Macbeth.

Apparitions not described in personae.

The specific descriptions of the three apparitions ("an armed Head", "A bloody Child" and "a Child crowned") are placed in stagedir elements, not in personae. William was a playwright, not a data analyst.

Changing speaker in the middle of the line.

In my paper version, you may see text like this:

Began a fresh assault.
   DUNCAN             Dismay'd not this
Our captains, Macbeth and Banquo?
   Sergeant                       Yes;

which has no representation in the XML source.


The 3 Witches speaking

The style sheet 3witches.xsl yields 3witches.html, which is a compilation of all lines spoken by the witches.

Modifications to the XML

I have made some changes to the original XML, to make macbeth.xml more suitable to my purposes. I try not to modify Shakespeare's work, only its XML markup.