Word Frequency Text Profiler

Word frequency text profiling can be used in many ways to support teaching, learning and research. The Word Profiler compares all the words in a text with two word frequency lists, it provides a visual profile of the distribution of these words in a text by printing the different frequency bands in different colours.
Google
Words which are contained in the first list of most frequent words are left in the default text colour. Words which are found in the second word list (see below) are printed in red and words which are not in either of the lists are printed in blue.

The off-list words are listed separately, and this list will contain new or unfamiliar words, as well as genre-specific words. The analysed texts and wordlists are saved and can be viewed here.
The Word Profiler provides three ways to analyse texts.

Profile a text by comparing its words with the MFWL 1-2k and MFWL 2-5K
Profile a text against the first 1000 Most Frequent Word Families and the second 1000 Most Frequent Word Families in Academic English (MFWL K1 - MFWL K2)
Profile a text with the MFWL K1 + K2 and the Academic Word List

Sample Profiles for The Blue Carbuncle by Arthur Conan Doyle

A comparison of the profiles for The Blue Carbuncle

Total number of words parsed in this text = 7819

Profile WORD LIST No of words % of total
1 2000 Most Frequent  Words  6331  80.97
  Next 2-5K Most Frequent Words 579  7.41
  off list 909  11.63
2 1000 Most Frequent Word Families (K1)  6196  79.24 
  Next 1001-2000 Most Frequent Word Families (K2)  473  6.05
  off list 1150  14.71
3 2000 Most Frequent Word Families (K1 + K2)  6669  85.29
  Academic Word List  112  1.43
  off list 1038  13.28

1. Profile a text by comparing its words with the MFWL 1-2k and MFWL 2-5K

This analysis contrasts words found in the Most Frequent Word Lists built from the Brown Corpus with Concapp for Windows. The lists are based solely on word counts using the Uniique Words Profiler which lists the instances for each word (the Brown Corpus comprises 1,015,945 words with 47,198 unique words). The start of the list is as follows:

 WordInstances% Frequency
1.The699706.8872
2.of364103.5839
3.and288542.8401
4.to261542.5744
5.a233632.2996
6.in213452.1010
7.that105941.0428
8.is101020.9943

The first list contains the first 2000 most frequent words (1-2000) and the second list contains the next three thousand most frequent words (2001-5000). These lists reflect general non-academic English as it is used in newspapers, magazines and books.

Profile 1 gives the lowest number of off-list words (909).

2. Profile a text against the Most Frequent Word Families in Academic English (MFWL K1 - MFWL K2)

This analysis contrasts words found in the 1 - 1000 (K1) word families and the second Most Frequent 1001 - 2000 (K2) word family lists found in Academic English, and were developed by Paul Nation of the School of Linguistics and Applied Language Studies at Victoria University of Wellington, New Zealand. They are more sophisticated than the lists created with the Brown corpus, as they contain not only the actual high frequency words themselves but also derivative words which may in fact not be used so frequently. For instance, the word ACCEPT is listed in the K1 first 1000 most frequent word families, and the derived words are listed as follows:

ACCEPT
  • acceptability
  • acceptable
  • unacceptable
  • acceptance
  • accepted
  • accepting
  • accepts
The derived words unacceptable and acceptability are also included, and the total number of words in the list is actually 4105 words. similarly, list K2 contains 3711 words. Comparing the first two ways of profiling a text, there is not in fact so much difference in terms of the number of words which appear in black, red or blue (off list). Profile 1 compares the words against the two MFW lists, one of 2000 words and the other containing 3000 words. The figures in the table show that the result is not so greatly different between Profile 1 and Profile 2.

 blackredblue
Profile 16331 (80.97%)579 (7.41%)909 (11.63%)
Profile 26196 (79.24%)473 (6.05%)1150 (14.71%)

To see the differences in the off-list wordlists found in the three profiles see the comparison of lists page. The figures show a bigger difference in the number of words in red, the second MFW list. In fact. more words are found in the MFW list for Profile 1 than the K2 list used in Profile 2, although K2 contains more words (3711) whereas the list in Profile 1 contains 3000 words. The combined total number of words in the two lists in Profile 1 is 5000, compared with 7816 in the two lists used in Profile 2. In spite of this, the number of off-list words in Profile 1 is about 3% less than that in Profile 2.

3. Profile a text with the MFWL K1 + K2 and the Academic Word List

This analysis contrasts words found in the 2000 Most Frequent Word Families and Academic Word Lists as compiled by Paul Nation. They reflect academic English as it is used in universities.

Academic Word List

The Academic Word List is listed in the Net Dictionary Index and contains 570 word families, comprising 3,110 words, which were selected according to their frequency of use in academic texts. The list does not include words that are in the most frequent 2000 words of English. The AWL was primarily made so that it could be used by teachers as part of a programme preparing learners for tertiary level study or used by students working alone to learn the words most needed to study at tertiary institutions.

Using the Word Frequency Profiler to provide an objective test of readability

You can use these Text Analysis functions to provide a measure of text readability by passing texts that you use at different levels to be compared against the word frequency lists.  Examples of 2 texts which have been profiled in this way can be seen here:

 Both of these examples have been profiled using 2 word lists: 

  1. The most 2000 most frequently used words (based on the Brown corpus) 

  2. The next 2 - 5000 most frequently used words (Brown corpus) 

The first example text, "American History", is a simplified text which I wrote specifically for lower intermediate EFL students. About 90% of the words are in the first 2K MFW list, and only 4 words are not in either list. 

The second text "The good language learner" was not written for language students at all, and only about 75% of the words are found in the first 2K MFW list. There are 81 words not in either list, over 14% of the total, so we can say that the second text clearly presents very much more difficulty for the learner. 

As a measure of readability, these percentages could be used as an objective way of gauging readability. 

edict home page | List texts | Unique words text profiler | Net Dictionary

edict virtual language centre.
All Rights Reserved.