http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/
May 2003
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, and Zhen Zhang
Computer Science Department
University at Illinois at Urbana-Champaign
binhe[at]uiuc.edu
The domains, number of sources in each domain, and number of query interfaces in each domain are summarized below.
| Domain | # of source | # of interfaces |
| Airfares | 47 | 49 |
| Automobiles | 84 | 97 |
| Books | 65 | 67 |
| Car Rentals | 25 | 25 |
| Hotels | 39 | 39 |
| Jobs | 49 | 52 |
| Movies | 73 | 78 |
| Music Records | 65 | 70 |
The dataset is browsable through http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/browsable.html.
The 8 domains are listed on the above page, with links to respective domain pages, where sources in each domain are listed. For each source, this dataset archives both its root homepage and the query-interface pages. A source has one root homepage and possibly multiple interface pages. Each archived page is linked to both the archived (i.e., "cached") copy as well as its original URL. So, a source has several links as follows:
Homepage URL: link to the
original URL of the root homepage of the source;
Homepage Cached: the archived root homepage of the source;
Interface URL: link to the original URL of a query-interface page of
the source;
Interface Cached: the archived copy of the query-interface page.
The browsable dataset can also be downloaded as a gzipped tar ball from http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/interface_downloadable.tar.gz. After untar and decompression, the directory structure for each domain is as below.
domain_name.xml: the XML file describing the name, homepage and interface
pages of the corresponding domain;
domain_name.htm: the HTML file of the web page for the corresponding
domain. This is generated from domain_name.xml;
homecached: the directory containing cached homepages (and related files) of every sources;
interfacecached: the directory containing cached interface pages (and related files) of every sources.
The DTD of domain_name.xml is as below:
<!ELEMENT domain (domainname, compiler, contributors, source+)>
<!ELEMENT source (srcname, srcurl, srcinterface+, browsable, browsedepth)>
<!ELEMENT domainname (#PCDATA)>
<!ELEMENT compiler (#PCDATA)>
<!ELEMENT contributors (#PCDATA)>
<!ELEMENT srcname (#PCDATA)>
<!ELEMENT srcurl (#PCDATA)>
<!ELEMENT srcinterface (#PCDATA)>
<!ELEMENT browsable ("yes"|"no")>
<!ELEMENT browsedepth (#PCDATA)>
The elements are explained below:
The dataset contains manually extracted query capabilities for all interfaces in Interface Archieve, downloadable as a gzipped tar ball from http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/capability_downloadable.tar.gz. After untar and decompression, the directory contains 8 files - one for each domain named after the corresponding domain name. The file contains manually extracted query capabilities for interfaces in the corresponding domain as classified in Interface Archieve. For each interface, we record a set of query forms presented in the interface page. A form (represented as form element) consists of a set of attribte groups (represented as attrgroup element), and an attribute group represents a set of exclusively queried attributes, which can be supported one at a time by the interface. For each attribute (represented as attr element), we record
<!ELEMENT query_vocabulary (domainname, compiler, contributors, source+)>
<!ELEMENT source (srcname, form+)>
<!ATTLIST source
url CDATA #REQUIRED>
<!ATTLIST source
interurl CDATA #REQUIRED)>
<!ELEMENT form (attrgroup+)>
<!ELEMENT attrgroup (attr+)>
<!ELEMENT attr (input+)>
<!ATTLIST attr
name CDATA #REQUIRED>
<!ATTLIST attr
ename CDATA #IMPLIED>
<!ELEMENT input (operator, domain)>
<!ELEMENT operator (option*)>
<!ATTLIST operator
format CDATA #IMPLIED >
<!ATTLIST operator
name CDATA #IMPLIED >
<!ATTLIST operator
ename CDATA #IMPLIED >
<!ELEMENT domain (value+)>
<!ATTLIST domain
format CDATA #REQUIRED >
<!ELEMENT domainname (#PCDATA)>
<!ELEMENT compiler (#PCDATA)>
<!ELEMENT contributors (#PCDATA)>
<!ELEMENT srcname (#PCDATA)>
<!ELEMENT option (#PCDATA)>
<!ELEMENT value (#PCDATA)>
The elements are explained below: