Encoders of text often find it useful to indicate that some aspects
of the encoded text are problematic or uncertain, and to indicate who is
responsible for various aspects of the markup of the electronic text.
These Guidelines provide three methods of recording uncertainty about the
text or its markup:
the note element defined in section may
be used with a value of certainty for its type
attribute.the certainty element defined in this chapter may be used
to record the nature and degree of the uncertainty in a more structured
way.the alt element defined in the additional tag set for
linking and segmentation may be used to provide alternative encodings
for parts of a text, as described in section .
There are three methods of indicating responsibility for different
aspects of the electronic text:
the TEI header records who is responsible for an electronic text
by means of the respStmt element and other more specific elements
(author, sponsor, funder, principal,
etc.) used within the titleStmt, editionStmt, and
revisionDesc elements.the note element may be used with a value of resp
or responsibility in its type attribute.the respons element defined in this chapter may be used
to record fine-grained structured information about responsibility for
individual tags in the text.
To use the note and respStmt elements, no special steps
are needed, since they are defined in the core tag set and header
respectively. The alt element is only available when the
additional tag set for linking has been selected, as described in
chapter . To use the certainty and
respons elements, the additional tag set for certainty and
responsibility must be selected; this is done by defining the
parameter entity TEI.certainty with the value
INCLUDE, as shown in the example below:
]>
...
]]>
Levels of Certainty
Many types of uncertainty may be distinguished. The
certainty element is designed to encode the following sorts:
a given tag may or may not correctly apply (e.g. a given word may
be a personal name, or perhaps not)the precise point at which an element begins or ends is
uncertainthe value to be given for an attribute is uncertaincontent supplied by the encoder (such as the expansion of
an abbreviation marked by the abbr tag) is
uncertainthe transcription of a source text is uncertain, perhaps
because it is hard to read or hard to hear; this sort of
uncertainty is also handled by the unclear element in
section
The following types of uncertainty are not indicated
with the certainty element:
a number or date is imprecisethe text is ambiguous, so a given passage has several possible
interpretationsa transcriber, editor, or author wishes to indicate a level of
confidence in a factual assertion made in the textan author is not sure if the sentence she has chosen to start a
paragraph is really the one she wants to retain in the final version
Precision of numbers and dates is discussed in section
; well-defined ambiguity is handled with
alternations in feature-structure values in chapter .
Uncertainty about the truth of assertions in the text and other sorts of
authorial and editorial uncertainty about whether the content is
satisfactory are not handled by the certainty element,
though they may be expressed using note.
Using Notes to Record Uncertainty
The simplest way of recording uncertainty about markup is to attach a
note to the element or location about which one is unsure. In the
following (invented) paragraph, for example, an encoder might be
uncertain whether to mark Essex as a place name or a personal
name, since both might be plausible in the given context:
Elizabeth went to Essex. She had always liked Essex.
Using note, the uncertainty here may be recorded quite simply:
Elizabeth> went to
Essex>.
She had always liked Essex>.
It is not
clear here whether Essex
refers to the place or to the nobleman. -MSM
]]>
Using the normal mechanisms, the note may be associated
unambiguously with specific elements of the text, thus:
Elizabeth> went to
Essex>.
She had always liked Essex>.
It is not clear here whether Essex
refers to the place or to the nobleman. If the latter,
it should be tagged as a personal name. -MSM
]]>
The advantage of this technique is its relative simplicity. Its
disadvantage is that the nature and degree of uncertainty are not
conveyed in any systematic way and thus are not susceptible to any sort
of automatic processing.
Structured Indications of Uncertainty
To record uncertainty in a more structured way, susceptible of at
least simple automatic processing, the certainty element may be
used:
indicates the degree of certainty or uncertainty associated
with
some aspect of the text markup.
Attributes include:
points at the elements whose markup is uncertain.indicates the precise location of the uncertainty in the
markup: applicability of the element, precise position of
the start- or end-tag, value of a specific attribute, etc.
Suggested values include:
the value given for the attribute name is
uncertain.the content of the element may not have been correctly
supplied by the reader, e.g. as in the cases of
corr and abbrev elements.both the start-tag and the end-tag may not be correctly
located.start-tag may not be correctly located.end-tag may not be correctly located.the content of the element may not be a correct
transcription of the source text.uncertain whether the element used actually applies to the
passage.indicates conditions assumed in the assignment of a degree
of confidence.indicates the degree of confidence assigned to the aspect
of the markup named by the locus attribute.provides an alternative value for the aspect of the markup
in question---an alternative generic identifier,
transcription, or attribute value, or the ID of an
anchor element (to indicate an alternative
starting or ending location). If an
assertedValue is given, the confidence level
specified by degree applies to the alternative
markup specified by assertedValue; if none is
given, it applies to the markup in the text.further describes the uncertainty in prose, perhaps
indicating its nature, cause, or the justification for the
degree of confidence asserted.
The certainty element may be used to record doubts about
the proper encoding of Essex in several ways of varying
precision. To record merely that we are not certain that Essex
is in fact a place name, as it is tagged, we use the target
attribute to identify the element in question, and the locus
attribute to indicate what aspect of the markup we are uncertain about
(here: whether we have used the correct element type):
Essex>.
]]>
Because it is linked to the location of the uncertainty by an IDREF, the
certainty element will typically be included in the same SGML
document as its target. It may be placed adjacent to the target
element, or elsewhere in the document.
To record the further information that we estimate, subjectively,
that there is a 60 percent chance of Essex being a place name here, we
can add a value for our degree of confidence (usually a
number between 0 and 1, representing the estimated probability):
Essex>.
]]>
According to one expert, there is a 60 percent chance of
Essex being a place name here, and a 40 percent chance of its being a
personal name. We use two certainty elements to indicate the
two probabilities independently. Both elements indicate
the same location in the text, but the second provides an alternative
choice of generic identifier (here: persName) is given as the
value of the assertedValue attribute:
Essex>.
]]>
Finally, we may wish to make our probability estimates contingent
on some condition. In the passage Elizabeth went to Essex; she had
always liked Essex, for example, we may feel there is a 60 percent chance
that the county is meant, and a 40 percent chance that the earl is meant. But
the two occurrences of the word are not independent: there is (we may
feel) no chance at all that one occurrence refers to the county and one
to the earl. We can express this by using the given
attribute to list the SGML identifiers of certainty elements.
Essex>.
She had always liked Essex>.
]]>
When given conditions are listed, the certainty
element is interpreted as claiming a given degree of confidence in a
particular markup given the assertional content of the
certainty elements indicated---that is, if the markup
described in the indicated certainty elements is
correct.
Conditional confidence may be less that 100 percent: given the sentence
Ernest went to old Saybrook, we may interpret Saybrook as
a personal name or a place name, assigning a 60 percent probability to the
former. If it is a place name, there may be a 50 percent chance that the
place name actually in question is Old Saybrook rather than
Saybrook, while if it is correctly tagged as a personal name, it
is much more likely (say, 90 percent certain) that the name is Saybrook.
This state of affairs can be expressed using the
certainty element thus:
old
Saybrook.
]]>
In this case, the assertedValue on certainty element
c3 is an IDREF to an anchor element at the alternate starting
point for the element.
Multiplying the numeric values out, this markup may be interpreted as
assigning specific probabilities to three different ways of
marking up the sentence:
Saybrook>. (0.6 * 0.9, or 0.54)
Earnest went to old Saybrook>. (0.4 * 0.5, or 0.20)
Earnest went to old Saybrook>. (0.4 * 0.5, or 0.20)
]]>
The probabilities do not add up to 1.00 because the markup indicates
that if Saybrook is (part of) a personal name, there is a
10 percent likelihood that the element should start somewhere other than the
place indicated, without however giving an alternative location; there
is thus a 6 percent chance (0.1 * 0.6) that none of the alternatives given is
correct.
If an attribute value is uncertain, the locus attribute
takes as its value the name of the attribute in question. In this
example, there is only a 50 percent chance that the question was spoken by
participant A:
Have you heard the election results?
]]>
Doubts about whether the transcription is correct may be expressed
by assigning to locus the value
#transcribedContent. For example, if the source is
hard to read and so the transcription is uncertain:
gub.
]]>
Degrees of confidence in the proper expansion of abbreviations may
also be expressed, by using the value #suppliedContent:
Standard
Generalized Markup Language ...
]]>
The assertedValue attribute should be used to provide an
alternative value for whatever aspect of the markup is in doubt: an
alternative generic identifier, or the ID of an alternative starting or
ending point, as already shown, an alternative attribute value, or
alternative element content, as in this example:
gub.
]]>
Since attribute values have no internal substructure, the
assertedValue attribute is useful for specifying alternative
transcriptions only in relatively restricted circumstances
(specifically, when the alternate reading has no elements nested within
it). More robust methods of handling uncertainties of transcription are
the unclear element and the app and rdg
elements described in chapter .
The
certainty element allows for indications of uncertainty to
be structured with at least as much detail and clarity as appears to be
currently required in most ongoing text projects.
It is expected that in the future more adequate systems for expressing
uncertainty will be developed. These may extend the certainty
element or they may make use of the feature-structure encoding
mechanisms described in chapter .
The certainty element and the other TEI mechanisms for
indicating uncertainty provide a range of methods of graduated
complexity. Simple expressions of uncertainty may be made by using the
note element. This is simple and convenient, and can
accommodate either discursive unstructured indication of uncertainty, or
complex structured project-specific expressions of uncertainty. In
general, however, unless special steps are taken, the note
element does not provide as much expressive power as the
certainty element, and in cases where highly structured
certainty information must be given, it is recommended that the
certainty element be used.
The certainty element may be used for simple unqualified
indications of uncertainty, in which case only the locus
and target might be specified. In more complex cases, the
other attributes may be used to provide fuller information. While
they may take any string of characters as value, the recommended
values should be used wherever possible; if they are not appropriate
in a given situation, encoders are should provide their own controlled
vocabulary and document it in the encodingDesc or
tagUsage elements of the TEI header.
The certainty element has the following formal declaration:
]]>
Attribution of Responsibility
In general, attribution of responsibility for the transcription and
markup of an electronic text is made by respStmt elements
within the header: specifically, within the title statement, the
edition statement(s), and the revision history.
In some cases, however, more detailed element-by-element information
may be desired, in order to distinguish, for example, between the
individuals responsible for transcribing the content and those
responsible for determining that a given word or phrase constitutes a
proper noun. Where such fine-grained attribution of responsibility is
required, the respons element may be used:
identifies the individual(s) responsible for some aspect of
the
markup of some particular element(s).
Attributes include:
gives the SGML identifier(s) of the element(s) for which
some aspect of the responsibility is being assigned.indicates the specific aspect of the markup for which
responsibility is being assigned.
Suggested values include:
responsibility for the claim that the name
attribute has the value given in the markupresponsibility for the contents supplied by the encoder
(corrections, expansions of abbreviations, etc.)responsibility for the claim that the element begins and
ends where indicatedresponsibility for the claim that the element begins where
indicatedresponsibility for the claim that the element ends where
indicatedresponsibility for the transcription of the element contentresponsibility for the claim that the element is of the
type indicated by the markupidentifies the individual or agency responsible for the
indicated aspect of the electronic text.gives a brief prose note supplying any additional
information which should be recorded
This element allows one or more aspects of the markup to be
attributed to a given individual. The target and
locus attributes function as they do on the
certainty element described in section :
the target attribute points at a particular SGML element (or
set of elements), and locus indicates the particular aspect
of the encoding of those elements, for which responsibility is to be
assigned. The suggested values may be combined as appropriate: to
indicate, for example, that RC is responsible for transcribing an
illegible word, and that AR is responsible for identifying that word
as a proper noun, the text might be encoded thus:
Saybrook.
]]>
Some elements bear specialized resp or agent
attributes, which have specific meaning which varies from element to
element; the respons element should be reserved for the general
aspects of responsibility common to all text transcription and SGML
markup, and should not be confused with the more specific attributes on
individual elements.
The formal declaration of the respons element is this:
]]>