TEL-8 Query Interfaces

http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/
May 2003


1. Overview

This dataset collects the original and manually extracted query interfaces of 447 deep Web sources from 8 representative domains, which form 3 groups "TEL" (and thus the name of the dataset)-- in the Travel group: Airfares, Hotels, and Car Rentals; in the Entertainment group: Books, Movies, and Music Records; in the Living group: Jobs and Automobiles. For each source, this dataset archives cached copies of both the root homepage and query interface pages, together with links to their original URLs. Further, this dataset also contains manually extracted query capabilities presented in the interfaces of the sources. Note that what this dataset provides are the cached pages, as a static "snapshot" for experimental study-- The content on an "original URL" may have been changing over time since archiving, and it is provided just for reference (and to indicate the original sources of the archived data).

2. Creation

The dataset was originally collected in December 2002 and inspected and revised several times till May 2003. The dataset is collected manually from Web directories (e.g., www.invisibleweb.com -- now www.profusion.com, www.completeplanet.com, dir.yahoo.com, etc.) and general search engines (e.g., www.google.com). A source is considered to be a deep Web source if it provides structured information by accepting queries over the attributes on its query interface. The sources are manually inspected and classified into 8 domains. For each source, its homepage (normally the root page of the Web site) and pages containing query interfaces are cached. Some sources have multiple query interfaces. For each interface, the represented query capabilities are manually extracted.

Original Owners

Kevin Chen-Chuan Chang, Bin He, Chengkai Li, and Zhen Zhang
Computer Science Department
University at Illinois at Urbana-Champaign
binhe[at]uiuc.edu

Date Created: May 2003

3. The Dataset

3.1. Summary

The domains, number of sources in each domain, and number of query interfaces in each domain are summarized below.

Domain # of source # of interfaces
Airfares 47 49
Automobiles 84 97
Books 65 67
Car Rentals 25 25
Hotels 39 39
Jobs 49 52
Movies 73 78
Music Records 65 70

3.2. Browsable Dataset

The dataset is browsable through http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/browsable.html.

The 8 domains are listed on the above page, with links to respective domain pages, where sources in each domain are listed. For each source, this dataset archives both its root homepage and the query-interface pages. A source has one root homepage and possibly multiple interface pages. Each archived page is linked to both the archived (i.e., "cached") copy as well as its original URL. So, a source has several links as follows:

Homepage URL: link to the original URL of the root homepage of the source;
Homepage Cached: the archived root homepage of the source;
Interface URL: link to the original URL of a query-interface page of the source;
Interface Cached: the archived copy of the query-interface page.

3.3. Downloadable Dataset

The dataset is made downloadable from two separate tar balls.Interface Archive contains all the pages presented in the Browsable Dataset. Query Capability Archive contains all the manually extracted query capabilities.

1). Interface Archive

The browsable dataset can also be downloaded as a gzipped tar ball from http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/interface_downloadable.tar.gz. After untar and decompression, the directory structure for each domain is as below.

domain_name.xml: the XML file describing the name, homepage and interface pages of the corresponding domain;
domain_name.htm: the HTML file of the web page for the corresponding domain. This is generated from domain_name.xml;
homecached: the directory containing cached homepages (and related files) of every sources;
interfacecached: the directory containing cached interface pages (and related files) of every sources.

The DTD of domain_name.xml is as below:

<!ELEMENT domain (domainname, compiler, contributors, source+)>
<!ELEMENT source (srcname, srcurl, srcinterface+, browsable, browsedepth)>
<!ELEMENT domainname (#PCDATA)>
<!ELEMENT compiler (#PCDATA)>
<!ELEMENT contributors (#PCDATA)>
<!ELEMENT srcname (#PCDATA)>
<!ELEMENT srcurl (#PCDATA)>
<!ELEMENT srcinterface (#PCDATA)>
<!ELEMENT browsable ("yes"|"no")>
<!ELEMENT browsedepth (#PCDATA)>

The elements are explained below:

2). Query Capability Archive

The dataset contains manually extracted query capabilities for all interfaces in Interface Archieve, downloadable as a gzipped tar ball from http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/capability_downloadable.tar.gz. After untar and decompression, the directory contains 8 files - one for each domain named after the corresponding domain name. The file contains manually extracted query capabilities for interfaces in the corresponding domain as classified in Interface Archieve. For each interface, we record a set of query forms presented in the interface page. A form (represented as form element) consists of a set of attribte groups (represented as attrgroup element), and an attribute group represents a set of exclusively queried attributes, which can be supported one at a time by the interface. For each attribute (represented as attr element), we record

In particular, the DTD of these files is as below:

<!ELEMENT query_vocabulary (domainname, compiler, contributors, source+)>
<!ELEMENT source (srcname, form+)>
<!ATTLIST source   url   CDATA   #REQUIRED>
<!ATTLIST source   interurl   CDATA   #REQUIRED)>
<!ELEMENT form (attrgroup+)>
<!ELEMENT attrgroup (attr+)>
<!ELEMENT attr (input+)>
<!ATTLIST attr   name   CDATA   #REQUIRED>
<!ATTLIST attr   ename   CDATA   #IMPLIED>
<!ELEMENT input (operator, domain)>
<!ELEMENT operator (option*)>
<!ATTLIST operator   format   CDATA   #IMPLIED >
<!ATTLIST operator   name CDATA #IMPLIED >
<!ATTLIST operator   ename  CDATA   #IMPLIED >
<!ELEMENT domain (value+)>
<!ATTLIST domain   format   CDATA   #REQUIRED >
<!ELEMENT domainname (#PCDATA)>
<!ELEMENT compiler (#PCDATA)>
<!ELEMENT contributors (#PCDATA)>
<!ELEMENT srcname (#PCDATA)>
<!ELEMENT option (#PCDATA)>
<!ELEMENT value (#PCDATA)>

The elements are explained below:

5. Past Usage

5. Acknowledgement

The creation of this dataset is partially supported by grants from National Science Foundation and NCSA.
Back to TEL-8 Query Interfaces