Databank and Knowledge Database a critical study of Cabré and Melby

posted by: Ravi Kumar
Hits: 578

The research papers, “From Terminological Data Banks to Knowledge Databases: The Text as the Starting Point” by M. Teresa Cabré, 2003 and “Terminology in the Age of Multilingual Corpora” by Alan K. Melby, 2012 respectively confirm the opinion of Pierre Auger who presented his futuristic view on terminology in 1989 and predicated that knowledge banks will play major role in the coming years.

Cabré as well as Melby provide information on advancement of computer science that has influenced all walks of life including language processing and terminology. The advancement of linguistic technologies and artificial intelligence has brought dynamic changes in the translation process and terminology management.

Cabré first talks about the evolution of electronic linguistic resources, terminological and lexical data banks from which the nomenclature of thematic glossaries and dictionaries could be extracted. These were used as reference resources in order to find answers to the linguistic queries related to translation and standardization contexts. They followed database record format, in which each lexical or terminological unit is organized in terms of grammatical information, domain and definition. For multilingual banks similar fields are applied to their equivalents in their respective languages.  

In the second stage, the text banks become useful tools for translators, as they not only include lexical units and the fields described above, but also the multiple non-fragmented authentic contexts of a given lexical unit. This gives the translators a choice to refer to the usage of the given meaning as well. Further, text banks become test beds adapted for description of the units in discourse. This was followed by evolution of large monolingual corpora, often called as reference corpora. The texts included in this kind of corpus are selected based on the criteria of representativeness and balance. The corpora like COBUILD corpus (Bank of English), Corpus de Referenda del Español Actual (CREA), LE-PAROLE Project of European Union (EU) etc. have played an important role in compilation of large linguistic resources, at times containing multimillion words, which is yet another step forward in the advancement of linguistic resources in electronic form.

Cabré further explains about the third stage that involves moving from large general language corpora to the compilation of smaller corpora with more focused content. Cabré identifies the following three developments in this direction where Information Technology plays a key role in data storage, data update and data access in a targeted user-friendly and efficient way.

 Development of text banks on specialized subjects. They help: 

  • a)      to create standardized terminology
  • b)      to designate new concepts (neologisms) that have come into use on account of exponential growth of science and technology, not only at national level but also in international communication. Hence, they facilitate translation, multiculturalism, specialized training, etc.
  • c)      to develop tools that help in automatic identification and extraction of candidate terms and text summaries or terminological databanks and ontologies.

2.      Development of text banks on pragmatic and communicative criteria

  • a)     Text bank is classified by genre that provides explicit access to various genres. It includes text-type information about each text. This facilitates the descriptive linguistic analysis of specialized texts and makes it possible to compare them based on the selection and frequency of different grammatical devices being used. For example, the genre based text bank compiled at the Faculty of Translation and Interpretation at Universitat Jaume I, Castello.

3.      Tagged Banks

  • a)     Text banks have been enriched by adding additional information to various linguistic units through tags. The encoded information may contain tags that can be morphological, syntactic, semantic or pragmatic in nature. This gives the users various options to search through the text bank and get text/terms/lexical units or related forms belonging to a given lemma.
  • b)   Apart from many other available tagging options, parts of speech tagging (parser) is one of the critical tagging options that help to automatically process data on the basic of linguistic criteria and not only on the basis of superficial recognition of the character strings (TACT, Wordsmith). In order to avoid superficial recognition of this type, at least it is necessary to have parts of speech tagged.
  • c)      In addition to parsed text bank, morphological tagging is done by breaking each lexical unit down into its constituent morphemes. This helps in getting the groups consisting of units with the same morphological configuration.

Cabré further links these developments to the development of integrated digital resources which are termed as knowledge databases that are more comprehensive, expensive and labor-intensive. Cabré further explains that knowledge databases are superior to terminological data banks owing to several reasons. They work as the knowledge repositories of a specialized field and not only contain contexts in which a unit appears, but they also provide semantic knowledge. In general, the knowledge databases helps:

  • 1.      to retrieve all the contexts in which a unit appears,
  • 2.      to retrieve semantic knowledge of each unit,
  • 3.      to retrieve the distinctive contexts of each unit through ontologies,
  • 4.      to retrieve concepts designated by each term,
  • 5.      to retrieve other concepts in the same field connected by means of different relations (hypernymy - word that is more generic than a given word;  hyponymy - relations between meaning; meronymy-part to whole relation; causality, etc) and
  • 6.      to retrieve a set of relationship between the contents of a term, its concept and a set of relationship with other concepts in the specialized field.

On the utility of knowledge databases, Cabré explains that they can be useful to members of various professions involved in the communication of specialized knowledge:

  • a)      Translators and interpreters can find answers to linguistic and conceptual queries.
  • b)      Terminologists and lexicographers can use them to facilitate the elaboration of general and specialized dictionaries.
  • c)      Documentation professionals can use them to elaborate thesauri and classifications to index documents and to facilitate information retrieval.
  • d)      Technical writers can use them for increasing their knowledge in specialized domains.

Cabré further explains and demonstrates that it is possible to develop an Integrated Knowledge Database. She explains in detail, about “The Genome Project” (2004) that has integrated the following four types of resources into a single platform:

a)      a text bank of a specialized field

b)      a documentary bank on bibliographic information and other relevant information on texts

c)       a terminological bank of the term records/units belonging to the field

d)      an ontology that represents the concept structure of the field

 

The main characteristics of the above-mentioned four modules are summed up in the table below:

Name

Content

Explanation/comments

Documentary Bank

GENDOFAC database contains all bibliographical references of sources of context related to terms recorded.

Contains subfields of the Human Genome Project.

Contains monographs, magazines, journal, articles, references etc.

 

Subfields are: internal structure, genetic engineering, diseases, genetic research, immunology, biotechnology etc.

Text Bank

Contains a set of texts on the human genome thematically arranged as per the subfields explained above. 

Text is in Catalan, Spanish and English.

Text tagged morphologically by researcher team.

Terminological bank

Developed in parallel with ontology. Single concept principle. For each term there will be an associated concept.

Followed theoretical framework of the communicative theory of terminology (CTT) described in Cabré (1999, 2003). Gives scope to explain situation and understand usage of the term.

Ontologoy

Functional tags added to concept. Helps get a set of relationship of a concept with other concepts in the field. Used OntoTerm system that has an ontology editor, a browser and a HTML code generator.

The relations belong to a closed list- similarity (positive, negative), inclusion (class inclusion or hyponymy) sequentiality (place, time), causality (causal), Instrument, meronymy, association (general, specialized)

Thus, we notice that it is possible to create an integrated knowledge database structure. Cabré demonstrates how it is possible to converge three viewpoints: the cognitive (concept) with linguistics (the term) and the communicative (the situation) approaches. Cabré proposed “theory of doors” in her research on theories of terminology published in 2003, she applied them in Gerome project. By combining text bank,   documentary bank, terminological bank and ontology at a common platform, she opened a bigger door, the integrated one, for the terminological unit that remains at the core of the terminology. 

On the lines of Cabré, Alan K. Melby in his research paper, “Terminology in the Age of Multilingual Corpora” (2012), deals with the importance of terminology management, termbases vs. glossaries, multilingual corpora and translation technology, an optimistic view of the human role in terminology and problems of standardization in TermBase eXchange (TBX) files and their interoperability and also calls for collaborative efforts on part of the translation and language technology providers to develop a consensus on using unified TBX so that advancement in knowledge sharing takes place without any technological hindrances.

Melby’s approach coincides with the second and third stage explained above by Cabré. He talks about multilingual corpora and their growing importance in machine translation platforms of Google Translate, Bing Translate and open-source Moses project. In this context Bowker (2011) opines  that termbases will be replaced by online multilingual corpora. Melby refutes this idea and emphasizes on the need of specialized and domain specific termbases.

While explaining the meaning of termbase, Melby along with his colleagues Inge Karsch and Jost Zetzsche refers to the principles of concept-oriented terminology work (ISO 704:2009), ends up siding with the idea of traditional school of terminology, the onomasiological approach that considers, concept as priori of the term that designate them and follows monosemous relationship-the single concept principle. He differentiates between termbase and an electronic general purpose lexicographical dictionary that contains general meaning and usage. He also affirms that concept orientation in termbase does not imply that the concepts are universal across all cultures and time periods.

Melby differentiates between termbases and glossaries and explains that monolingual glossaries containing a collection of terms and definitions that are relevant to a particular domain or a particular project, generally used for maintaining consistency, may be considered as monolingual termbase. On the other hand, a two-column bilingual list of terms, created from a source text, without following terminology principles, generally prepared by translators for their reference and use in translation, do not meet even the minimal requirement of termbase, i.e., the single concept principle. Hence, they do not qualify as content of a termbase. Melby further explains about the basic structure of termbase, and the necessary information needed to create a termbase and refers to LISA Terminology standards SIG 2008:4, ISO 12620, 1999 and ISO CAT, 2009 respectively. He also refers to Marshman et al. 2012 on concept relations, such as generic-specific and part-whole.

Melby presents a critical view of the impact of multilingual corpora and translation technology that tend to lessen the need for termbases and reveals that the new techniques of using multilingual corpora, machine translation and translation memory features help in getting better results. Also, he gives importance to ‘clean’ data for creating automated glossaries of termbases. He explains that although the new technologies have given access to large multilingual corpora, but as most of them do not contain clean data, human intervention becomes mandatory and therefore, may slow down or impede the translation process. Therefore, there exists the need of domain specific corpora, high quality termbases, human intervention for quality control, consistency and other linguistic validation once the automated processing of translation or terminology is done by new technologies. This would further mean new roles to be played by translators or terminologists in the translation process.

Based on the optimistic view of translation industry leaders, Melby emphasizes the growing importance of termbases that are bound to become more and more automated. As the automation of MT and termbases evolve, the need for professional translators and professional terminologists who can enhance these automatically generated resources will also increase.

After having presented a critical view of the need and continued demand of termbases, Melby explains about the TermBase eXchange (TBX) file format, that represents the information in a high-end termbase in a neutral intermediate format, compliant with Terminological Markup Frame (TMF) (ISO 11642 2003) and calls for standardization of all termbases based on TMF metamodel to be fully represented in TBX.

Melby explains the structure of TMF metamodel defined in ISO 30042: 2008: 8 which is simplified and flexible one and provides the possibility of managing conceptual relations between entries, thus it opens wider avenues for the Semantic Web and TBX provides a solid foundation for representing such relations in termbases, hence greater possibilities of interaction between TBX and OWL (web ontology language).

Melby further explains the growing importance of TBX and explains that TBX being accepted by major service providers of terminology management tools is also being used by Termium, the upcoming NATO terminology management system and Microsoft glossaries.

In the end, Melby explains the future scenario with an anticipated widespread use of TBX that would help in sharing terminology freely within a supply chain. It would help the end user take decisions on changing a service provider without having to invest much on acquiring new tools. It would help improve the supply chain of documents within an organization by allowing quality control of the content by various authors and allow better coordination and collaboration.

Conclusion

Based on the technological development and advancement of various disciplines and the need to adapt to new realities of the society, Cabré’s approach seems to be a realistic one. Without entering into conflict with various approaches to terminology, she recognizes the need to converge them at a common platform and develops an integrated approach by combining documentary bank, text bank, terminological bank and ontology and moves away from general corpora to a smaller but more specialized knowledge database.

Similarly, Melby’s observation on the dynamic nature of terminology management gives us an insight about the need to standardize the termbases in terms of TBX file format described as per the Terminological Markup Framework (TMF). This would help terminologists, translators, service technology providers, language policy planners as well as end users to increase the efficiency of workflow and supply chain and give an option of database portability across the globe. This, would, in turn facilitate the advancement of knowledge and create newer avenues of specialized work in language, translation and terminology domain.

 Some concerns

  1. With growing automation of translation and translation processes, what is the future of terminology and terminologists?
  2. Has Cabré been able to resolve the conflict between traditional approach and sociolinguistic approach?
  3. Cabré as well as Melby and many other national as well as international organizations continue to follow the single concept principle, how far are they able to address concern raised by new approaches to terminology that questions about the communicative aspect of languages, and protypicality of concepts?
  4. With rise of hybridity of Machine tools and Memory tools, and specialized knowledge banks, what is the future of termbases and TBX file formats?

 REFERENCES

Cabre, Maria Teresa (2006).“From Terminological Data Banks to Knowledge Databases: The text as the Starting Point”. In Lexicography, Terminology and Translation: Text-based Studies in Honor of Ingrid Meyer. Ed. Bowker, Lynne, Ottawa, the University of Ottawa Press, p.93-105.

Melby Alan K. (2012). “Terminology in the Age of Multilingual Corpora”. The Journal of Specialized Translation, 18, 7-27

Bowker, Lynne (2011). “Off the record and on the fly.” Alet Kruger, Kim Wallmach and Jeremy Munday (eds). (2011) Corpus-based Translation Studies: Research and Applications. London: Continuum, 211-236.

Marshman, Elizabeth, Julie L. Gariépy and Charissa Harms (2012). “Helping Language Professionals relate to terms: Terminological Relations and Termbases.” The Journal of Specialised Translation 18, 45-71.

0
0
0
s2sdefault
Certified Quality Translation Services in Delhi