Tracking English and Translated Arabic News using GHSOM


By Ali Selamat, Hanadi Hassen Ismail Mohammed.


After the September eleventh attacks, the media has contributed in clarifying the discrepancies between eastern and western cultures. As stated by Michael Binyon, a journalist in the The Times newspaper, whether they work for newspapers, television, or radio programs, all journalists are bound by their professional ethics codes, to provide truthful and accurate news. Unfortunately, only very few media outlets can boast of saying that everything they have put out is everything that had really happened (Michael Binyon, 2008). Take, for example, the violence in Gaza, a journalist was asked by the publisher of a Middle Eastern newspaper why the BBC gave so much coverage to the rockets aimed by Hamas at Israeli towns but gave so little coverage to the Israeli air strikes in Gaza. His answer was because there were no BBC correspondents based in Gaza. This shows that there is a need to track similar news from different resources because watching news from one resource may not always give audience the real picture of the event. There are many advantages of finding similar content across different languages. For example, it can form the basis for multilingual summarization and the question answering support for web pages that provide questions and answers. It also facilitates comparative studies across national, ethnic, and cultural groups (Jin & Barrire 2005). Dittenbach, et al.,2001 have stated that human categorizations are based on grouping similar objects into a number of categories. This ahs been done in order to understand the differences between objects that belong to each defined category. There are many approaches have been used for finding similarity across multilingual documents such as neural networks, fuzzy clustering, genetic algorithms, support vector machines, etc. Most of these techniques are baed on the assumption on the availability of a clean corpus and the majority of these pages are written independently of each other (Jin & Barrire 2005). Therefore, two versions of the same topic that are written in two different languages cannot simply be taken as parallel corpora. One of the techniques used to solve this problem is Self-Organizing Map (SOM) (Kohonen, 1990). It is an unsupervised neural network algorithms which provide a topology-preserving mapping from a high-dimensional document space into a two dimensional map space. The documents that are on similar topics are located in neighboring regions (Dittenbach, et al.,2001). SOM produces additional information about the affinity or similarity between the clusters themselves by arranging them on a 2D rectangular or


  • Aalarab (2007). Aalarab news. 2007. Adafre, Rijke (2006). Finding Similar Sentences across Multiple Languages in Wikipedia. 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62-69, Trento, Italy, April 2006, Association for Computational Linguistics, Morristown, NJ, USA
  • Aljazeera (2007). Aljazeera news. 2007. Asharqalawsat (2007). Asharqalawsat news. www.asharqalawsat. 2007. BBC (2007). BBC news. 2007
  • Chen, R. Chau, C. Yeh (2004). Discovering Parallel Text from the World Wide Web. Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalisation, pp 157–161, Dunedin, New Zealand, 2004, Australian Computer Society, Inc. Darlinghurst, Australia
  • CNN (2007). CNN news. 2007
  • Dittenbach, Rauber, D. Merkl(2001). Business, Culture, Politics, and Sports - How to Find Your Way Through a Bulk of News? On Content-Based Hierarchical Structuring and Organization of Large Document Archives. Proceedings of the 12th P International Conference on Database and Expert Systems Applications, pp 200 – 210, 3-540-42527-6, Munich, Germany, September 2001,Springer-Verlag London, UK
  • Evans (2005). Identifying Similarity in Text: Multi-Lingual Analysis for Summarization. Ph.D. Thesis. Columbia University. 2005
  • Helmut, Dieter (2005). A comparison of support vector machines and self-organizing maps for e-mail categorization. Proceedings of the 4th Australasian Data Mining Conference (AusDM’05), pp 189-204, 1-86365-716-9, Sydney, Australia, December 2005, University of Technology Sydney, Australia
  • Jin, Barrire (2005). Exploring sentence variations with bilingual corpora. Corpus Linguistics 2005 conference. Birmingham, United Kingdom, July 2005, NRC Institute for Information Technology, Canada.
  • Kohonen (1990). The self-organizing map. Proceedings of the IEEE, pp 1464 – 1480, 0018- 9219, September 1990, IEEE.
  • Lee, Yang (2003). A Multilingual Text Mining Approach Based on Self-Organizing Maps. Journal of applied intelligent. Volume 18, No.3, (May 2003), page numbers (295- 310), 0924-669X.
  • Liu, Wang, Zheng (2005). Mental tasks classification and their EEG structures analysis by using the growing hierarchical self-organizing map. Neural Interface and Control, 2005, Proceedings of the First International Conference, 115- 118, 0-7803-8902-6, May 2005, IEEE.
  • Michael Binyon(2008). September 11th and the Western Media and Cross - Cultural Misunderstanding Role of Dialogue between Arab And West Seminar, Kuwait, 2008.
  • Selamat A. and Omatu S. (2004). Feature Selection and Categorization of Web Pages Using Neural Networks, Int. Journal of Information Sciences, Elsevier Science Inc. Vol. 158, (January 2004), page number (69-88).
  • Selamat A., Choon N-C, Abu Bakar A.Z., Mikami Y.(2007), Arabic Script Web Documents Language Identification Using Decision Tree-ARTMAP Model, 2007 International Conference on Convergence Information Technology (ICCIT 2007), pp. 21-23, November 2007, Gyeongju-si, Gyeongbuk, Korea
  • Tangsripairoj, Samadzadeh (2005). Organizing and visualizing software Repositories using the growing hierarchical Self-organizing map. Proceedings of the 2005 ACM symposium on Applied computing SAC, pp. 1539- 545, 1-58113-964-0, Santa Fe, New Mexico, 2005, ACM, New York, NY, USA
  • Xafopoulos A., Kotropoulos C., Almpanidis G. and Pitas I(2004). Language Identification in Web Documents Using Discrete HMMs. Pattern Recognition. Vol. 137, No. 3, March2004, page numbers(583-394), 0031-3203.
  • Zamir (1999). Clustering Web Documents:A Phrase-Based Method for Grouping Search Engine Results. University of Washington: Ph.D.Thesis.
  • Zhai, Shah (2005).Tracking News Stories Across Different Sources.Proceedings of the 13th annual ACM international conference on Multimedia, pp. 2-10, 1-59593-044-2, Hilton Singapore, 2005, ACM, New York, NY, USA