Scott Fields à  Terry Brooks  Autumn 2004

Results of searches:

 

1. Googlebot tracking

 

In the ACM portal, I found no citations that were right on your topic.  Only some related to “query log analysis,” “web usage mining,” and “query processing.”

There are 8 citations in my binder for you to review.  The closest one is this—

Discovery of Web Robot Sessions Based on their Navigational Patterns

10.1023/A:1013228602957

 

Searching the web, most of what I find is discussion in web designer online forums regarding search engine optimization (SEO) and tracking the GoogleBot using software like SpyderTrax and RobotStats.  There were a few related articles in Search Engine Watch.

 

SpiderSpotting: When A Search Engine, Robot Or Crawler Visits

http://searchenginewatch.com/webmasters/article.php/2168001

 

Search Engines and Web Server Issues

http://searchenginewatch.com/searchday/article.php/3109261

 

Measuring Search Engine Success (search engine optimization)

http://searchenginewatch.com/searchday/article.php/3323241

 

Some web robot tracking software vendors—

CJ GoogleBot Activity  http://www.cj-design.com/index.php?id=downloads&page=13

RobotStats  http://www.robotstats.com/en/

SpyderTrax (Googletrax)  http://www.darrinward.com/

 

Googlebot FAQs    http://www.google.com/bot.html

Web robots FAQs  http://www.robotstxt.org/wc/faq.html

 

Some of the index terms and keywords I used in the ACM portal—

track* Googlebot

Googlebot activity

web traffic analy*

search engine optimization

search process

query log analysis

analysis web quer*

google query log

data mining

web robot detection

web usage mining

query processing

query AND optimization

 

 

2.  Link rot, broken links, web decay, etc.

 

Nov. 1

 

Terry,

There is a lot of research on link rot/web decay/link reliability, etc.  I found no consensus on solving the problem, but lots of ideas.

I placed 14 articles in a binder in the ACM portal.

http://portal.acm.org/

 

some of the ACM keywords:
404 return code, link analysis, web decay, broken URLs, dead links, lexical signatures, robust hyperlinks, Reliability , 404, link, link integrity

 

some of these articles are below

 

1--Link analysis: Sic transit gloria telae: towards an understanding of the web's decay

Ziv Bar-Yossef, Andrei Z. Broder, Ravi Kumar, Andrew Tomkins

May 2004

 

Proceedings of the 13th international conference on World Wide Web

http://doi.acm.org/10.1145/988672.988716

ABSTRACT

The rapid growth of the web has been noted and tracked extensively. Recent studies have however documented the dual phenomenon: web pages have small half lives, and thus the web exhibits rapid death as well. Consequently, page creators are faced with an increasingly burdensome task of keeping links up-to-date, and many are falling behind. In addition to just individual pages, collections of pages or even entire neighborhoods of the web exhibit significant decay, rendering them less effective as information resources. Such neighborhoods are identified only by frustrated searchers, seeking a way out of these stale neighborhoods, back to more up-to-date sections of the web; measuring the decay of a page purely on the basis of dead links on the page is too naive to reflect this frustration. In this paper we formalize a strong notion of a decay measure and present algorithms for computing it efficiently. We explore this measure by presenting a number of validations, and use it to identify interesting artifacts on today's web. We then describe a number of applications of such a measure to search engines, web page maintainers, ontologists, and individual users.

 

2

Web Information Retrieval: Analysis of lexical signatures for finding lost or related documents

Seung-Taek Park, David M. Pennock, C. Lee Giles, Robert Krovetz

August 2002

 

Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

http://doi.acm.org/10.1145/564376.564381

ABSTRACT

A lexical signature of a web page is often sufficient for finding the page, even if its URL has changed. We conduct a large-scale empirical study of eight methods for generating lexical signatures, including Phelps and Wilensky's [14] original proposal (PW) and seven of our own variations. We examine their performance on the web and on a TREC data set, evaluating their ability both to uniquely identify the original document and to locate other relevant documents if the original is lost. Lexical signatures chosen to minimize document frequency (DF) are good at unique identification but poor at finding relevant documents. PW works well on the relatively small TREC data set, but acts almost identically to DF on the web, which contains billions of documents. Term-frequency-based lexical signatures (TF) are very easy to compute and often perform well, but are highly dependent on the ranking system of the search engine used. In general, TFIDF-based method and hybrid methods (which combine DF with TF or TFIDF) seem to be the most promising candidates for generating effective lexical signatures.

 

3--The decay and failures of web references

http://doi.acm.org/10.1145/602421.602422

 

4--Electronic document addressing: dealing with change

http://doi.acm.org/10.1145/367701.367702

This article analyzes the problem of the integrity of electronic documents, in particular, of Web sites. The main problem is that hyperlinks are frequently changed, producing the well-known Error 404. Based on the stunning fact that the average life of a WWW document is only 50 days, the article shows that solutions must be found. Link integrity is identified as the main problem. Eleven solutions for link integrity are then discussed. Each solution has its scope and degree of customer satisfaction. Each solution is presented either by examples of systems which implements it or references to papers were it is discussed in depth. The author has tested each solution, mentioning the advantages and drawbacks. I consider very interesting the topic of this paper, about a very critical aspect of today's computer usage. The paper gives a broad and interesting description of problems and solutions. It has also many good references.   Online Computing Reviews Service

 

 

 

5--Referential integrity of links in open hypermedia systems

Hugh C. Davis

May 1998

 

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems

http://doi.acm.org/10.1145/276627.276650

 

 

6--RepWeb: replicated Web with referential integrity

We propose a system, RepWeb, comprised of an application to access and manage replicated web content and an implementation of an acyclic distributed garbage collection algorithm for wide-area replicated memory, that satisfies all these requirements. It supports replication, enforces referential integrity on the web and minimizes storage waste.

http://doi.acm.org/10.1145/952532.952766

 

 

Other non-ACM sites:

 

http://www.doi.org/overview/sys_overview_021601.html

Introductory Overview of the Digital Object Identifier System

 

http://citeseer.ist.psu.edu/citeseer.html

CiteSeer is a scientific literature digital library that aims to improve the dissemination and feedback of scientific literature, and to provide improvements in functionality, usability, availability, cost, comprehensiveness, efficiency, and timeliness.  Rather than creating just another digital library, CiteSeer provides algorithms, techniques, and software that can be used in other digital libraries. CiteSeer indexes Postscript and PDF research articles on the Web, and provides the following features.

 

Persistence of Web References in Scientific Research, IEEE Computer, Volume 34, Number 2, pp. 26–31, 2001.

http://www.neci.nec.com/~lawrence/papers/persistence-computer01/persistence-computer01.pdf

A promising effort is the Uniform Resource Name (URN) specification [12], produced by the Internet Engineering Task Force. A URN is a persistent, location-independent identifier which can be used to uniquely identify a resource. The name stays the same even when the location of the resource moves. Implementations of URNs include the Persistent Uniform Resource Locator (PURL) [11] system, and the Handle [1] system.

 

D-Lib Magazine, January 2002, Volume 8 Number 1
Object Persistence and Availability in Digital Libraries

http://www.dlib.org/dlib/january02/nelson/01nelson.html

Abstract

We have studied object persistence and availability of 1,000 digital library (DL) objects. Twenty World Wide Web accessible DLs were chosen and from each DL, 50 objects were chosen at random. A script checked the availability of each object three times a week for just over 1 year for a total of 161 data samples.

 

Link Accessibility in Electronic Journal Articles

http://www.cs.cornell.edu/bergmark/LinkAccessibility/paper/

Donna Bergmark
Cornell Digital Library Research Group

March 31, 2000

 

Archives, PURLS, DOIs, "GURL"s, and caches

http://books.valdosta.edu/mlis/govdoc/GOVDOCS/Archives.htm

 

 

And, then, of course is

http://www.archive.org/