![]() |
ConcGram List Builder (1) © |
You may know the terms 'n-gram' or 'cluster', which are used to refer to adjacent words which constitute a pattern of use and which recur in a corpus. Actual n-grams come in the form of bi-grams, tri-grams, and so on, indicating the number of words in the pattern of use. The terms 'skipgram' or 'phrase frame' are used to describe non-contiguous word associations of limited membership and which occur in a fixed pattern of use, for example patterns of co-occurrence such as "a lot of people" in instances such as "a lot of business people" or "a lot of different types of people", but the term also includes all contiguous patterns of co-occurrence and so subsumes n-grams. All these searches require that the words are in the same order (Cheng, Greaves and Warren, 2006).
But there are many word associations which do not occur in one fixed grammatical pattern. The relationship of verbs – adverbs, verbs – nouns, nouns – adverbs, noun phrase constituents, quantifier – noun, to name but a few, are flexible and may occur in non-fixed patterns. For instance, most adjectives can be used both attributively and predicatively. The bi-gram “challenging exercise” would show in an n-gram search, but when the adjective is used predicatively as in “the exercise turned out to be quite challenging” it would not. The positions for “challenging” in this case would be -1 and +6 respectively. But in both cases, the word association is significant.
This means that many instances that patterns
of co-occurrence that typically occur in non-contiguous sequences may not be
discovered. Associated word searches that are user-nominated are also limited by the requirement that the user must enter
(and therefore know) items to enable the search to take place. The automated concgram search
provided by ConcGram is able to reveal all word
associations (both contiguous and non-contiguous in a corpus, with both
positional (AB, BA) and constituent (ACB) variation) and, since
it is automated, the user does not have to first enter one or more search items.
An example of how a concgram search can reveal all the
potential patterns of associated words is given below, showing both positional
and constituent variation.
Figure 1: A concgram search for "Asia
/ world / city" (3-word concgram)
Figure 1 illustrates a 3-word
concgram
from the search results which had "Asia / city /
world " as the 3 associated words. On lines
3 - 7 we find the pattern "world city of Asia"
and from line 8 this becomes "Asia / Asia's world city".
A clustered tri-gram search would not have found the first pattern
because the words are not in sequence and would not have presented all of the
patterns together as possible combinations for these three associated words. The concgram
search also works when there are common features of spoken language such as repetition,
pauses or fillers (i.e. the use of "er", "um" etc).
On line 2 we can see that the concgram configuration is still listed when the speaker says "world
city of of of Asia" (Cheng, Greaves and Warren, 2006).
Although the concgram search will find all of the word associations
within a given span, the search will also
list co-occurrences that may not prove to be related when examined in
context. For this reason ConcGram also provides statistical
tests that can provide an indication as to which
word associations are likely to prove to be significant and which ones the user
can reasonably afford to ignore.
By creating automated concgram lists ConcGram can be used to identify all
word associations that may occur in a corpus within a certain span,
and this span can be tailored to suit the needs of the user. The searches can
create 2, 3, 4 and 5 word concgrams, but if you are using the automated
search you must start with a 2-word concgram list. From this initial
2-word list you can go on to build a 3-word concgram list, then a 4-word
list and finally a 5-word concgram list all derived from fully automated
searches.
John Sinclair has often spoken of "collocational frameworks" or "grammatical frames", and these illustrate another kind of concgram. He's referring to the fact that common grammatical words such as 'A' and 'OF' often combine with each other to form 'collocational frameworks' (Renouf and Sinclair, 1991). For example, the most common configuration pattern for ‘THE / TO’ is:
TO + INF VERB + THE + NOUN GROUP
e.g. to read the letter, to make the decision, to have the opportunity, to distinguish between the concept
When they form such collocational frameworks THE / TO form a collocational relationship with the words which they frame. A collocational frame can thus be described as co-occurring grammatical words which provide paradigmatic slots into which can be fitted similar words and which form a collocational relationship with the words framed.
Many groups of words which recur in a corpus and are sometimes listed as individual ‘clusters’ are formed from collocational frameworks. For instance, ‘IN TERMS OF’, ‘IN CASE OF’, ‘IN LIEU OF’ etc are all instances of the collocational framework ‘IN * OF’ where the * represents a slot into which may be put one or more words. The lexical items ‘terms’, ‘case’, ‘lieu’ etc are all members of the paradigmatic set from which they have been selected.
It is important to understand the difference between "origins" and "associated words". "Origins" may be single origins, double, triple or quadruple origins and these terms describe the process of creating automated 2, 3, 4 and 5-word concgram lists. Associated words are found as a result of the origin searches. Origins are indicated in the search progress dialog box as words separated by slashes, associated words are enclose in parenthesis brackets as in Figure 2 which shows the search string resulting from an origin 'Kong / been' together with an associated word 'plus'. Together they make a 3-word concgram.
Figure 2: Origins and associated words
|
|
The first word in the origin is always centred, but any word in the concgram can be centred by highlighting the word and pressing the "Switch Centred Word" button shown in Figure 3.
As well as performing automated searches, which will be described in more detail later, the user can specify the words to be searched for in the Concgram Search Dialog Box. This allows up to 5 words to be entered, and is most easily reached from the toolbar. The toolbar features 3 buttons which initiate searches. They look similar except for the colors used.
Figure 3: the Search Buttons on the Toolbar
The left button is for single word searches, the centre button for concgram searches, and the right button is for switching the centred word by performing a new search, as shown in Figures 5 and 6. Pressing the centre button to perform a concgram search produces the Concgram Search Dialog Box as shown in Figure 4:
Figure 4: the Concgram Search Dialog Box
For user-specified concgrams the first concgram will be sorted alphabetically so that by default the word which is earliest in the alphabet will be centred first. The resulting concgram search list will thus look like Figure 5:
Figure 5: the initial concgram search
The resulting concgram is 'a / can / you' with 'a' centred as it is the earliest alphabetical word in the group. Figure 5 also shows the use of light blue to show words which are repeated in the concgram. To make another word in the concgram the centred word simply select the word (the easiest way is to double click the left mouse button over the word) and then select the 'Switch Centred Word' button from the toolbar as shown in Figure 6:
Figure 6: changing the centred word
Selecting the indicated button will result in a new search centred on 'can' as in Figure 7:
Figure 7: the same concgram with 'can' centred
REFERENCES
Cheng, W., Greaves C. and Warren M. (2006) ‘From n-gram to skipgram to concgram’, in International Journal of Corpus Linguistics.
Renouf, A. J. and J. M. Sinclair (1991) ‘Collocational Frameworks in English’, in Ajimer and Altenberg (eds) English Corpus Linguistics, pp 128-43.
Automated Concgram Searches Next page ->