Concgram List Builder (7) ©


Statistical techniques for determining significance 

The reason for administering statistical tests is to attempt to calculate the significance of word associations in context.  While the fully automated concgram search will find all of the contiguous and non-contiguous collocations that constitute 2-word, 3-word, 4-word and 5-word concgrams, including both constituent (AB, ACB) and positional (AB, BA) variations, the search will also list word co-occurrences that may not prove to be meaningfully associated when examined in context.  For these reasons, ConcGram also runs statistical tests to generate t-scores and MI (Mutual Information) values to find out the statistically significant cut-offs for concgram lists and to provide the user with an indication as to which word associations are more likely to prove to be meaningful and which ones the user can reasonably afford to ignore. 

These tests are only available for 2-word concgrams as the formulas for calculating both t-score and MI values only provide values for the association of 2 words.  The formulas used for calculating both t-scores and MI values and the cut-offs suggested are those given by Geoff Barnbrook (Language and Computers: A Practical Introduction to the Computer Analysis of Language).  Two steps are necessary before these tests can be listed:

  1. You must first create an ordinary 2-word concgram list and save it.

  2. A list of all the unique words for the corpus you are using must be created and saved.

To create the list of unique words for the corpus select use the Unique Words Dialog  (STATISTICS >> UNIQUE WORDS):

Figure 30: the saved list of unique words

The list gives the total number of instances for each word as well as the total number of words in the file, both of which figures are needed for the calculations.

The list can then be created by selecting from the Concgrams Menu, and the figures for both t-score and MI listed next to each concgram.   

Figure 31: the Concgram Menu option to create a list with t-scores and MI values

After selecting this menu item and selecting the saved 2-word concgram list to operate on, the user will be prompted to select from the options in the T-scores List Preferences Dialog:

Figure 32: the T-scores List Preferences Dialog

You can choose either t-score or MI cut-offs, use both or have no cut-off at all.  Using cut-offs greatly reduces the size of the resulting lists.  After selecting these options and the list for unique words to use for the figures, a search for all the 2-word concgrams listed will be performed and the t-score and MI values for each will be calculated. Finally the T-scores List Dialog will be displayed as in Figure 33 below:

Figure 33: the T-scores List Dialog showing t-score and MI values listed for each 2-word concgram

This is not the place to discuss the usefulness of these tests, or the different results given. However, they are options provided for the user if required.  Figure 33 shows a display which has been sorted by t-scores. A sort by MI would produce very different values:

Figure 34: the T-scores List Dialog sorted by MI values

Figure 34 shows the top of the list produced by sorting according to MI values.  No grammatical words are included, only lexical items.  If you want to exclude from the list such items, using an MI cut-off will prove most useful.  But if you want to study grammatical items such as 'can' or 'you', using a t-score cut-off is better.