Scott Fields à Terry Brooks Autumn 2004
Results of searches:
1. Googlebot
tracking
In the ACM portal, I found no citations that were right on your topic. Only some related to “query log analysis,” “web usage mining,” and “query processing.”
There are 8 citations in my binder for you to review. The closest one is this—
Discovery of Web Robot Sessions Based on their Navigational Patterns
Searching the web, most of what I find is discussion in web designer online forums regarding search engine optimization (SEO) and tracking the GoogleBot using software like SpyderTrax and RobotStats. There were a few related articles in Search Engine Watch.
SpiderSpotting: When A Search Engine, Robot Or Crawler Visits
http://searchenginewatch.com/webmasters/article.php/2168001
Search Engines and Web Server Issues
http://searchenginewatch.com/searchday/article.php/3109261
Measuring Search Engine Success (search engine optimization)
http://searchenginewatch.com/searchday/article.php/3323241
Some web robot tracking software vendors—
CJ GoogleBot Activity http://www.cj-design.com/index.php?id=downloads&page=13
RobotStats http://www.robotstats.com/en/
SpyderTrax (Googletrax) http://www.darrinward.com/
Googlebot FAQs http://www.google.com/bot.html
Web robots FAQs http://www.robotstxt.org/wc/faq.html
Some of the index terms and keywords I used in the ACM portal—
track* Googlebot
Googlebot activity
web traffic analy*
search engine optimization
search process
query log analysis
analysis web quer*
google query log
data mining
web robot detection
web usage mining
query processing
query AND optimization
2. Link
rot, broken links, web decay, etc.
Nov. 1
Terry,
There is a lot of research on link rot/web decay/link reliability, etc. I found no consensus on solving the problem, but lots of ideas.
I placed 14 articles in a binder in the ACM portal.
some of the ACM
keywords:
404 return code, link analysis, web decay, broken URLs, dead links, lexical signatures, robust hyperlinks, Reliability , 404, link, link integrity
some of these articles are below
|
|
| ||||||
http://doi.acm.org/10.1145/988672.988716
The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.
|
2 |
| ||||||
http://doi.acm.org/10.1145/564376.564381
ABSTRACT
A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.
3--The decay and failures of web references
http://doi.acm.org/10.1145/602421.602422
4--Electronic document addressing: dealing with change
http://doi.acm.org/10.1145/367701.367702
This article analyzes the problem of the integrity of electronic documents, in particular, of Web sites. The main problem is that hyperlinks are frequently changed, producing the well-known Error 404. Based on the stunning fact that the average life of a WWW document is only 50 days, the article shows that solutions must be found. Link integrity is identified as the main problem. Eleven solutions for link integrity are then discussed. Each solution has its scope and degree of customer satisfaction. Each solution is presented either by examples of systems which implements it or references to papers were it is discussed in depth. The author has tested each solution, mentioning the advantages and drawbacks. I consider very interesting the topic of this paper, about a very critical aspect of today's computer usage. The paper gives a broad and interesting description of problems and solutions. It has also many good references. Online Computing Reviews Service
|
|
| ||||||
http://doi.acm.org/10.1145/276627.276650
6--RepWeb: replicated Web with referential integrity
We propose a system, RepWeb, comprised of an application to access and manage replicated web content and an implementation of an acyclic distributed garbage collection algorithm for wide-area replicated memory, that satisfies all these requirements. It supports replication, enforces referential integrity on the web and minimizes storage waste.
http://doi.acm.org/10.1145/952532.952766
Other non-ACM sites:
http://www.doi.org/overview/sys_overview_021601.html
Introductory Overview of the Digital Object Identifier System
http://citeseer.ist.psu.edu/citeseer.html
CiteSeer is a scientific literature digital library that aims to improve the dissemination and feedback of scientific literature, and to provide improvements in functionality, usability, availability, cost, comprehensiveness, efficiency, and timeliness. Rather than creating just another digital library, CiteSeer provides algorithms, techniques, and software that can be used in other digital libraries. CiteSeer indexes Postscript and PDF research articles on the Web, and provides the following features.
Persistence of Web References in Scientific Research, IEEE Computer, Volume 34, Number 2, pp. 26–31, 2001.
http://www.neci.nec.com/~lawrence/papers/persistence-computer01/persistence-computer01.pdf
A promising effort is the Uniform Resource Name (URN) specification [12], produced by the Internet Engineering Task Force. A URN is a persistent, location-independent identifier which can be used to uniquely identify a resource. The name stays the same even when the location of the resource moves. Implementations of URNs include the Persistent Uniform Resource Locator (PURL) [11] system, and the Handle [1] system.
D-Lib Magazine, January 2002, Volume 8 Number 1
Object
Persistence and Availability in Digital Libraries
http://www.dlib.org/dlib/january02/nelson/01nelson.html
Abstract
We have studied object persistence and availability of 1,000 digital library (DL) objects. Twenty World Wide Web accessible DLs were chosen and from each DL, 50 objects were chosen at random. A script checked the availability of each object three times a week for just over 1 year for a total of 161 data samples.
Link Accessibility in Electronic Journal Articles
http://www.cs.cornell.edu/bergmark/LinkAccessibility/paper/
Donna Bergmark
Cornell Digital
Library Research Group
March 31, 2000
Archives, PURLS, DOIs, "GURL"s, and caches
http://books.valdosta.edu/mlis/govdoc/GOVDOCS/Archives.htm
And, then, of course is