Vol.2 No.4 October,
2004
Editorial (pp213-214)
R Baeza-Yates
Research Articles and Reviews:
Discovering Search Engine Related Queries Using Association Rules
(pp215-227)
B.M. Fonseca, P.B. Golgher, E.S. de
Moura, B. Possas and N. Ziviani
This work presents a method for online generation of query related
suggestions for a Web search engine. The method uses association rules
to extract related queries from the log of submitted queries to the
search engine. Experimental results were performed on a real log
containing more than 2.3 million queries submitted to a commercial
search engine. For the top 5 related terms our method presented correct
suggestions in 90.5\% of the time. Using queries randomly selected from
a log we obtained 93.45\% of correct suggestions. A study of the user
behavior showed that in 92.23\% of the clicks on suggestions, users
found useful information. The same approach can be used to provide terms
to the classic problem of query expansion. For instance, the average
precision of the answers of the Google search engine was improved by
23.16\% using our aproach as a query expansion method.
On The Evolution of Clusters of Near-Duplicate Web Pages
(pp228-246)
D. Fetterly, M. Manasse and M. Najork
This paper expands on a 1997 study of the amount and distribution of
near-duplicate pages on the World Wide Web. We downloaded a set of 150
million web pages on a weekly basis over the span of 11 weeks. We then
determined which of these pages are near-duplicates of one another, and
tracked how clusters of near-duplicate documents evolved over time. We
found that 29.2\% of all web pages are very similar to other pages, and
that 22.2\% are virtually identical to other pages. We also found that
clusters of near-duplicate documents are fairly stable: Two documents
that are near-duplicates of one another are very likely to still be
near-duplicates 10 weeks later. This result is of significant relevance
to search engines: web crawlers can be fairly confident that two pages
that have been found to be near-duplicates of one another will continue
to be so for the foreseeable future, and may thus decide to recrawl only
one version of that page, or at least to lower the download priority of
the other versions, thereby freeing up crawling resources that can be
brought to bear more productively somewhere else. Additionally, we visit
issues raised in a 1999 study of the prevalence of mirrored content,
that is, trees of web content accessible at multiple locations. We found
that 4.9\% of all web pages are mirrors.
Retrieving Similar Documents from the Web (pp247-261)
A.R. Pereira Jr and N. Ziviani
This paper presents a mechanism for detecting and retrieving documents
from the web with a similarity relation to a suspicious document. The
process is composed of three stages: a) generation of a ``fingerprint''
of the suspicious document, b) gathering candidate documents from the
web and c) comparison of each candidate document and the suspicious
document. In the first stage, the fingerprint of the suspicious document
is used as its identification. The fingerprint is composed of
representative sentences of the document. In the second stage, the
sentences composing the fingerprint are used as queries submitted to a
search engine. The documents identified by the URLs returned from the
search engine are
collected to form a set of similarity candidate documents. In the third
stage, the candidate documents are compared to the suspicious document.
The process of comparing the documents uses two different methods:
Shingles and Patricia tree.
Ontology for Software Metrics and Indicators (pp262-281)
L. Olsina and M. Martin
Software and even more web measurement -as a younger discipline, are
currently in a stage in which terminologies, models, and methods are
still being defined and consolidated. It is a necessity to start
reaching a common agreement between researchers and other stakeholders
about primitive concepts such as attribute, metric, measure, measurement
and calculation method, scale, elementary and global indicator,
calculable concept, among others. There are various useful recently
issued ISO standards related to software quality models, measurement,
and evaluation processes; however, we observe sometimes a lack of a
sound consensus among the same terms in different documents or,
sometimes, absent terms. In this manuscript, we present an ontology for
software metrics and indicators -based as much as possible on the
concepts of those standards, which can be useful to support different
assurance processes, methods and tools, in addition to be the foundation
for our cataloging web system. In order to illustrate the ontology, we
focus particularly on a set of intermediate representations for the
domain (such as UML diagrams and tables), which were yielded during the
conceptualisation step. In addition, a discussion about decisions that
have been taken in choosing the terms is presented. Without sound and
consensuated definition of terms, attributes, and relationships it is
difficult to assure metadata consistency and, ultimately, data values
are comparable on the same basis.
Designing Virtual Environments to Support Collaborative Work in Real
Spaces (pp282-294)
L. Guerrero, C. Collazos, J. Pino, S.
Ochoa and F. Aguilera
Typical Collaborative Virtual Environments (CVEs) are
a metaphor of real environments, but they are not a copy of them. It is
very common in communities that members do not know each other or do not
have a real space for meetings. The design of a CVE for people who know
each other and interact in a real space is different to the traditional
CVE design. It should consider the real location of each resource,
appropriate awareness and communication strategies, and human-human and
human-resource relations. Our University Department was selected as an
example organizational unit for experimentation. We start with the real
physical environment and we design a CVE prototype to provide new
collaboration features for people working in the unit and for those who
will visit it. There are many advantages of the approach. First, people
are familiar with the basic physical environment. Second, some
activities requiring physical presence can be done with virtual
presence, enabling employees to work in convenient ways. Third, new
opportunities for collaborative work appear, as it is easy to do them
with the proposed CVE. Finally, the approach is extensible, since new
features can be added.
Structuring Information on the Web from Below: The case of
Educational Organizations in Chile (pp292-304)
E. Krsulovic-Morales and C. Gutierrez
This paper reports the framework and the experience of structuring and
integrating information of educational organizations in Chile, using
metadata along with Semantic Web ideas. We present an implementation for
Computer Science departments and a more general framework for
educational organizations.
Back
to JWE Online Front Page |