Reference Guide for the British National Corpus (XML Edition)
Reference Guide for the British National Corpus (XML Edition)
edited by Lou Burnard
Published for the British National Corpus Consortium by the Research Technologies Service at Oxford University Computing Services
February 2007
Contents
Introduction
Overview
Acknowledgments
BNC 1.0
BNC World
BNC XML
1
Design of the corpus
1.1
Purpose
1.2
General definitions
1.3
Composition
1.4
Design of the written component
1.4.1
Sampling basis: production and reception
1.4.2
Selection features
1.4.3
Descriptive features
1.4.4
Selection procedures employed
1.5
Design of the spoken component
1.5.1
The demographically sampled part of the corpus
1.5.2
The context-governed part of the corpus
2
Basic structure
2.1
Markup conventions
2.2
An example
2.3
Corpus and text elements
2.4
Segments and words
2.5
Editorial indications
3
Written texts
3.1
Divisions of written texts
3.2
Paragraph-level elements and chunks
3.2.1
Headings and captions
3.2.2
Quotations
3.2.3
Spoken paragraphs
3.2.4
Poetry
3.2.5
Lists
3.2.6
Notes and citations
3.2.7
Bibliographic references
3.3
Phrase-level elements
3.3.1
Page breaks
3.3.2
Highlighted phrases
4
Spoken texts
4.1
Basic structure: spoken texts
4.2
Utterances
4.3
Paralinguistic phenomena
4.4
Alignment of overlapping speech
5
The header
5.1
The file description
5.1.1
The title statement
5.1.2
The edition statement
5.1.3
The extent statement
5.1.4
The publication statement
5.1.5
The source description
5.2
The encoding description
5.2.1
Documentary components of the encoding description
5.2.2
The tagging declaration
5.2.3
The reference and classification declarations
5.2.4
The Xaira Specification
5.3
The profile description
5.3.1
The creation element
5.3.2
The <langUsage> element
5.3.3
The participant description
5.3.4
The setting description
5.3.5
Text classification
5.4
The revision description
6
Wordclass Tagging in BNC XML
6.1
Introduction
6.2
Tokenization: splitting the text into words
6.3
Tagging Guidelines and Borderline Cases
6.4
Ambiguity tags, and the principle of asymmetry
6.5
Guidelines to the Wordclass Tagging
6.5.1
Preliminaries
6.5.2
Introduction to Word Classes
6.5.3
Adverbs
6.5.4
Articles, determiners & pronouns
6.5.5
Prepositions and prepositional adverbs
6.5.6
Conjunctions
6.5.7
Numerals
6.5.8
Miscellaneous other tags
6.5.9
Disambiguation Guide
6.5.10
Features of spoken corpus tagging
6.6
POS-tagging Error Rates
6.6.1
Levels of estimation
6.6.2
Presentation of Ambiguity Rates and Error Rates (fine-grained mode of calculation)
6.6.3
A further mode of calculation: ignoring subcategories of the same part of speech
6.6.4
POS-Tagging Workflow
7
Software for the BNC
7.1
Why XML?
7.2
The BNC delivery format
7.3
XML components
8
References
9
Miscellaneous tables
9.1
XML tag usage by text type
9.2
Voice quality codes
9.3
Gap descriptions
9.4
Event descriptions
9.5
Speaker relationships
9.6
Text and genre classification codes
9.7
Contracted forms and multiwords
9.7.1
Contracted forms
9.7.2
Multiwords
9.8
Simplified Wordclass Tags
10
List of Sources
11
The Xaira Specification
11.1
Element specification
11.2
Key specification
11.3
Lemma Scheme Specification
11.4
Region Specification
11.5
Reference specification
11.6
Indexing Policies
11.6.1
Index policies NONE and MARKUP
11.6.2
Index policies JOINFROM and JOINTO
11.6.3
Index policy taxonomy
11.7
Language specification
12
Formal Specification of the BNC XML schema