BAMM Extracted Query Schemas
http://metaquerier.cs.uiuc.edu/repository/datasets/bamm
November 2002
1. Overview
This dataset is a collection of deep Web sources in 4 representative
domains: Books, Automobiles, Movies, and Music Records (and thus the name BAMM). Each domain contains about 50 sources. For each source, only
the attribute names in the Web query interfaces are extracted to form its "query
schema." BAMM does not
contain other attribute information such as attribute types or values.
2. Creation
The dataset was originally constructed in
November 2002 by manual collection from Web directories
(e.g., www.invisibleweb.com-- now www.profusion.com, dir.yahoo.com). A source is
considered to be a deep Web source if it provides structured information by
accepting queries over the attributes on its query interfaces. For each query
interface, only attribute names are extracted. For instance, for the advanced
book query of amazon.com, we extract its attribute names as author, title, subject, ISBN, publisher, ....
Original Owners
Bin He and Kevin Chen-Chuan Chang
Computer Science Department
University at Illinois at Urbana-Champaign
binhe[at]uiuc.edu
Date Created: November 2002
3. The Dataset
3.1. Summary
The domains and number of sources in each domain are summarized as below:
| Domain
| # of sources
|
| Books
| 55
|
| Automobiles
| 55
|
| Movies
| 52
|
| Music Records
| 49 |
3.2. Browsable Dataset
The dataset is browsable through
http://metaquerier.cs.uiuc.edu/repository/datasets/bamm/browsable.html.
The 4 domains are listed on the above page. Following links to respective
domains, the extracted attribute names for each source are listed.
3.3. Downloadable Dataset
The whole dataset can be downloaded as a gzipped tar ball from
http://metaquerier.cs.uiuc.edu/repository/dataset/bamm/downloadable.tar.gz.
This tar ball contains four domain files, each file for one domain respectively.
Each domain file contains a set of sources. Each source is denoted by
<SCHEMA> and followed by a set of attribute names. For instance, below is
a sample domain file with two sources.
<SCHEMA>
Title
Author
ISBN
Publisher
Price
<SCHEMA>
Title
Author
Price
Binding
4. Past Usage
5. Acknowledgement
The creation of this dataset is partially supported by grants from National Science
Foundation and NCSA.
Back to BAMM Extracted Query Schemas