BAMM Extracted Query Schemas
This dataset is a collection of deep Web sources in 4 representative
domains: Books, Automobiles, Movies, and Music Records (and thus the name BAMM). Each domain contains about 50 sources. For each source, only
the attribute names in the Web query interfaces are extracted to form its "query
schema." BAMM does not
contain other attribute information such as attribute types or values.
The dataset was originally constructed in
November 2002 by manual collection from Web directories
(e.g., www.invisibleweb.com-- now www.profusion.com, dir.yahoo.com). A source is
considered to be a deep Web source if it provides structured information by
accepting queries over the attributes on its query interfaces. For each query
interface, only attribute names are extracted. For instance, for the advanced
book query of amazon.com, we extract its attribute names as author, title, subject, ISBN, publisher, ....
Bin He and Kevin Chen-Chuan Chang
Computer Science Department
University at Illinois at Urbana-Champaign
Date Created: November 2002
3. The Dataset
The domains and number of sources in each domain are summarized as below:
||# of sources
3.2. Browsable Dataset
The dataset is browsable through
The 4 domains are listed on the above page. Following links to respective
domains, the extracted attribute names for each source are listed.
3.3. Downloadable Dataset
The whole dataset can be downloaded as a gzipped tar ball from
This tar ball contains four domain files, each file for one domain respectively.
Each domain file contains a set of sources. Each source is denoted by
<SCHEMA> and followed by a set of attribute names. For instance, below is
a sample domain file with two sources.
4. Past Usage
The creation of this dataset is partially supported by grants from National Science
Foundation and NCSA.
Back to BAMM Extracted Query Schemas