In my sixth posting, I had discuss about the article “Beyond Concordance Lines: Using Concordances to Investigate Language Development” by Arshad Abd Samad page 70 in OTL with my partner, Lim Yang Sing.
This is the summary of the article.Language corpus is very useful as a basic dictionary and teaching materials. The concordance software can helps on analyzing language data. Corpora often help to inform on how words and grammatical constructions are used. The benefit of using a corpus in language teaching and learning is help students to look at the systematicity of language as an interesting linguistic puzzle, rather than a set of boring rules to be memorized. The corpus are compiled by researchers at University Teknology Malaysia (UTM), University Malaya (UM), and the English of Malaysian school students or EMAS corpus by researchers from University Putra Malaysia (UPM). According to Arshad, the EMAS corpus that used in the study consists of close to half a million words. About 800 students who involved Year 5, Form1 and Form 4 students, was written data in the form of three essays. The corpus contributed by these students was considered as being above average in English Language Proficiency. The major criterion in selecting the topic for the essays was the amount of the language could elicit. The students were expected able to write more if they familiar with the topic. There are various methods that can be used to determine the development of language. One of the examples of the method is numerous language acquisition studies, which focus on specific-target structures and examine the acquisition of these structures. The development patterns can be implied by comparing the language use by three different age groups. The language productivity and vocabulary can be studying to know the developmental patterns. Language productivity of three age groups is compared in order to examine language development. This finding shows that there is increasing in cognitive maturity of older students. Older students are able to produce longer essays with more complex sentences. The diversity of the vocabulary used in a corpus can be determined by calculating the type to token ratio. According to Laufer Nation, the ratio can be calculated by dividing the number of separate words in a text (type) by the number of words in the text (token). The research signifies that the older respondents use a wider range of vocabulary in their essays. The diversity of the vocabulary continues to show increases from the lower age to higher age groups because of the uses of the wider range of vocabulary in their essays. The nature of the written text itself may partially contribute to the low ratios obtained. It still show increase from lower to higher age of groups when estimate the average number of word types per student. EMAS corpus is a learner corpus is a learner corpus that retained the students’ spelling and grammatical errors. On the other hand, the sophiscation of the vocabulary can be determined by using RANGE software, for example. RANGE is a vocabulary analysis program, which gives an indication of the kind of vocabulary used. The observation shows that the older age groups tend to use a wider rage of words, the words they use are more sophisticated. In conclusion, concordance software helps a lot on analyze the language data. These article has attempted to present the relevance of corpus data in investigating language development without having to analyze concordance.
We read a text in order to understand what it says, analyse it to discover how it says, what it says to us. Analysis focuses on the detail like individual words and phrases, patterns formed by these, the contexts required to make sense of the text as a whole. It is concerned with what we do as a matter of course when we read a text but pay little or no attention to directly. It seeks to explain our impressions, trace them back to their causes in the language, or perhaps show us that we were mistaken. Not all texts are straightforwardly about what they seem or profess to be about. People commonly say one thing but mean another because they are lying, unaware or confused, or are dealing with a subject too complex for direct treatment. Analysis may therefore uncover contrary or contradictory meanings in a text, show how a subject is being avoided or is indicated indirectly. Analysis will show how he or she manages, as it were, to speak the unsaid able. Among the most basic tools of text-analysis is the concordance.
A concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. The first concordance was made in the late 12th or early 13th Century as a means of marshalling evidence from the Bible for teaching and preaching; concordances for works of secular literature followed much later. A concordance derives its power for analysis from the fact that it allows us to see every place in a text where a particular word is used, and so to detect patterns of usage and, again, to marshal evidence for an argument. Since words express ideas, themes and motifs, a concordance is highly useful in detecting patterns of meaning as well. The concordance focuses on word-forms, however not on what may be meant but what is actually said. It is an empirical tool of textual research. Concordance is one kind of rearrangement to which a researcher might wish to subject a text in order to trick out its meaning. One might wish to list all groups of contiguous words repeated in a text two or more times, or a list of the words in a text ranked in order of their frequency of occurrence, or a chart showing the distribution of specific words across a text. Such transformations of a text, and any others we might devise, are known as “text-analysis”. A simple concordance program has several features. There are selection, lemmatization, collocation, display, sorting and frequency lists.
The first feature is selection. This feature is to specify for which word(s) you want to see a concordance. There have two main possibilities which are wordlist and query. Wordlist is a possibility which a concordance program will provide a complete list of words from which to select. Monoconc, for example, lists all words in a corpus alphabetically or by frequency of occurrence. Most if not all concordance program offer a means of generating a concordance based on a query in which you specify the form you want together with optional symbols, called wildcards, to indicate any other letters. The query may also allow for proximity searching, in which you specify that you want to see a specific word-form only if it is found within a certain number of words from another word-form you specify. For example, we were interested in finding where someone is said to possess a bag, we might want to select all passages in which the word “bag” is found within 5 words of “have”, “has” or “had”. Besides proximity searching, the query allow for phrase searching in which you specify fixed sequences of words to be found, such as “in case of”.
The second feature of concordance program is Lemmatization, which to group together word-forms under a single headword or lemma. Even with a powerful query language, one cannot easily group together all the related forms of highly inflected words, such as “go”, thus “goes”, “gone”, “went”. One may also need to handle variations in spelling, such as between British and American forms, or accommodate other differences, such as between hyphenated and non-hyphenated forms. Monoconc, unfortunately, does not provide a means of grouping together words according to a common lemma. Concordance, a more sophisticated program, allows the user to create his or her own groups, thus manually to lemmatize variant forms or define a group of synonyms, e.g. “bag”, “luggage”, “back-pack”, “carry-all”.
The other feature is Collocation, which to discover what words are found in close proximity to a given word. The interest in collocation is based on the idea that meaning tends to be communicated not so much by single words as by combinations within a specified distance known as the span. The span varies by language; for English meaningful connections are likely to be found within 5 words on either side of the target-word. Thus, to cite a trivial example, the fact that “the” collocates very frequently with “bag” in a given text, especially to the left of the word, suggests quite strongly that a particular bag is the object of interest.
The next feature is Display, which on screen with option to print. The most popular and highly influential format for concordances is the KWIC or “keyword in context”, in which the target word is centred and an arbitrary amount of context give on either side. Following is an example in which the target word id centred.
The fifth feature of concordance program is sorting. The KWIC format is made much more effective if we can sort the lines according to the words that occur before and after the selected word, as well as according to the order in which the occurrences are found in the text.
The last feature is the frequency list. The frequency list is the simplest example of statistical information that may be gained by counting features of the text, then subjecting these counts to mathematical transformations. Frequency lists have been included with alphabetic concordances for a long time, even before concordances were first produced with the help of computing. The basic idea behind a frequency list is that the more frequently a word is used the more likely it is to be important to the meaning of a text and to its stylistics. A frequency list is therefore sometimes useful in detecting the basic preoccupations of a text, especially when these do not coincide with the apparent subject, and for characterising the linguistic habits of the author.
Concordance analysis of microbial genomes RE Bruccoleri, TJ Dougherty and DB Davison The set of proteins which are conserved across families of microbes contain important targets of new anti-microbial agents. We have developed a simple and efficient computational tool which determines concordances of putative gene products that show sets of proteins conserved across one set of user specified genomes and not present in another set of user specified genomes. The thresholds and the homology scoring criterion are selectable to allow the user to decide the stringency of the homologies. The system uses a relational database to store protein coding regions from different genomes, and to store the results of a complete comparison of all sequences against all sequences using the FASTA program. Using Web technology, the display of all the related proteins for a given sequence and calculation of multiple sequence alignments (using CLUSTALW) can be performed with the click of a button. The current database holds 97 365 sequences from 19 complete or partial genomes and 8798905 FASTA comparison results. An example concordance is presented which demonstrates that the target of the quinolone antibiotics could have been identified using this tool. |
Above is the article on the application of concordance in context analysis. The word “of” is selected and all the “of” in the passage were highlight in red colour.
REFERENCES:
1. http://en.wikipedia.org/wiki/Concordance_(publishing)
2. http://en.wikipedia.org/wiki/concordance
3. www.concordancesoftware.co.uk
4. http://lexisnexis.com.concordance-35k-
5. http://nar.oxfordjournals.org/ogi/content/abstract/26/19/4482
6. http://en.wikipedia.org/wiki/KWIC