JWE Abstracts 

Vol.2 No.4 October, 2004

Editorial (pp213-214)
        R Baeza-Yates
Research Articles and Reviews:
Discovering Search Engine Related Queries Using Association Rules (pp215-227)
        B.M. Fonseca, P.B. Golgher, E.S. de Moura, B. Possas and N. Ziviani
This work presents a method for online generation of query related suggestions for a Web search engine. The method uses association rules to extract related queries from the log of submitted queries to the search engine. Experimental results were performed on a real log containing more than 2.3 million queries submitted to a commercial search engine. For the top 5 related terms our method presented correct suggestions in 90.5\% of the time. Using queries randomly selected from a log we obtained 93.45\% of correct suggestions. A study of the user behavior showed that in 92.23\% of the clicks on suggestions, users found useful information. The same approach can be used to provide terms to the classic problem of query expansion. For instance, the average precision of the answers of the Google search engine was improved by 23.16\% using our aproach as a query expansion method.

On The Evolution of Clusters of Near-Duplicate Web Pages (pp228-246)
        D. Fetterly, M. Manasse and M. Najork
This paper expands on a 1997 study of the amount and distribution of near-duplicate pages on the World Wide Web. We downloaded a set of 150 million web pages on a weekly basis over the span of 11 weeks. We then determined which of these pages are near-duplicates of one another, and tracked how clusters of near-duplicate documents evolved over time. We found that 29.2\% of all web pages are very similar to other pages, and that 22.2\% are virtually identical to other pages. We also found that clusters of near-duplicate documents are fairly stable: Two documents that are near-duplicates of one another are very likely to still be near-duplicates 10 weeks later. This result is of significant relevance to search engines: web crawlers can be fairly confident that two pages that have been found to be near-duplicates of one another will continue to be so for the foreseeable future, and may thus decide to recrawl only one version of that page, or at least to lower the download priority of the other versions, thereby freeing up crawling resources that can be brought to bear more productively somewhere else. Additionally, we visit issues raised in a 1999 study of the prevalence of mirrored content, that is, trees of web content accessible at multiple locations. We found that 4.9\% of all web pages are mirrors.

Retrieving Similar Documents from the Web (pp247-261)
        A.R. Pereira Jr and N. Ziviani
This paper presents a mechanism for detecting and retrieving documents from the web with a similarity relation to a suspicious document. The process is composed of three stages: a) generation of a ``fingerprint'' of the suspicious document, b) gathering candidate documents from the web and c) comparison of each candidate document and the suspicious document. In the first stage, the fingerprint of the suspicious document is used as its identification. The fingerprint is composed of representative sentences of the document. In the second stage, the sentences composing the fingerprint are used as queries submitted to a search engine. The documents identified by the URLs returned from the search engine are
collected to form a set of similarity candidate documents. In the third stage, the candidate documents are compared to the suspicious document. The process of comparing the documents uses two different methods: Shingles and Patricia tree.

Ontology for Software Metrics and Indicators (pp262-281)
        L. Olsina and M. Martin
Software and even more web measurement -as a younger discipline, are currently in a stage in which terminologies, models, and methods are still being defined and consolidated.  It is a necessity to start reaching a common agreement between researchers and other stakeholders about primitive concepts such as attribute, metric, measure, measurement and calculation method, scale, elementary and global indicator, calculable concept, among others. There are various useful recently issued ISO standards related to software quality models, measurement, and evaluation processes; however, we observe sometimes a lack of a sound consensus among the same terms in different documents or, sometimes, absent terms. In this manuscript, we present an ontology for software metrics and indicators -based as much as possible on the concepts of those standards, which can be useful to support different assurance processes, methods and tools, in addition to be the foundation for our cataloging web system. In order to illustrate the ontology, we focus particularly on a set of intermediate representations for the domain (such as UML diagrams and tables), which were yielded during the conceptualisation step. In addition, a discussion about decisions that have been taken in choosing the terms is presented. Without sound and consensuated definition of terms, attributes, and relationships it is difficult to assure metadata consistency and, ultimately, data values are comparable on the same basis.

Designing Virtual Environments to Support Collaborative Work in Real Spaces (pp282-294)
        L. Guerrero, C. Collazos, J. Pino, S. Ochoa and F. Aguilera
Typical Collaborative Virtual Environments (CVEs) are a metaphor of real environments, but they are not a copy of them. It is very common in communities that members do not know each other or do not have a real space for meetings. The design of a CVE for people who know each other and interact in a real space is different to the traditional CVE design. It should consider the real location of each resource, appropriate awareness and communication strategies, and human-human and human-resource relations. Our University Department was selected as an example organizational unit for experimentation. We start with the real physical environment and we design a CVE prototype to provide new collaboration features for people working in the unit and for those who will visit it. There are many advantages of the approach. First, people are familiar with the basic physical environment. Second, some activities requiring physical presence can be done with virtual presence, enabling employees to work in convenient ways. Third, new opportunities for collaborative work appear, as it is easy to do them with the proposed CVE. Finally, the approach is extensible, since new features can be added.

Structuring Information on the Web from Below: The case of Educational Organizations in Chile (pp292-304)
        E. Krsulovic-Morales and C. Gutierrez
This paper reports the framework and the experience of structuring and integrating information of educational organizations in Chile, using metadata along with Semantic Web ideas. We present an implementation for Computer Science departments and a more general framework for educational organizations.

Back to JWE Online Front Page