The UIUC Web Integration Repository
Overview
Over the past few years, the Web has deepened dramatically- A
significant and increasing amount of structured information is hidden on the
"deep Web," behind the query interfaces of searchable databases. The goal of
this repository is to provide both datasets and tasks for supporting research toward
exploring and integrating structured information and searchable
databases on the Web. The repository was constructed originally in the MetaQuerier project at the University
of Illinois at Urbana-Champaign, and has been joined with datasets contributed
from the research community. We are continuing to build up the repository and we
welcome donations of additional data.
This repository contains both datasets and their related tasks.
On one hand, each dataset in this repository archives a certain set of data
pertinent to Web integration research-- for structured information and
searchable databases on the Web. On the other hand, each task documents certain
functionality pertinent to such integration (e.g., schema matching) that
has been studied and performed over some datasets. The datasets and their
associated tasks are interlinked for cross reference.
Acknowledgement: The creation of this repository is partially
supported by grants from National Science
Foundation and NCSA.
Datasets
- TEL-8
Query Interfaces:
This dataset collects the original query interfaces of 447 deep Web sources from 8 representative domains, which form 3 groups "TEL" (and thus the name of the dataset)-- in the Travel group: Airfares, Hotels, and Car Rentals;
in the Entertainment group: Books, Movies, and Music Records; in
the Living group: Jobs and Automobiles.
Created in May, 2003.
- BAMM
Extracted Query Schemas:
This dataset contains manually extracted attributes
from Web query interfaces in 4 domains - Books, Automobiles,
Movies and Music Records. Created in Nov, 2002.
- ICQ Query Interfaces:
This dataset contains manually extracted query interfaces from five domains:
Airfare, Automobile, Book, Job, and Real Estate. The extracted interfaces
are represented by ordered trees.
- IWRandom:
This dataset contains 33 original query interfaces randomly sampled from 16 top level domains from Invisible-Web.net. Created in November 2003.
- OntoBuilder:
This dataset contains more than 100 ontologies in 14 domains. The ontologies are given in XML format conforming to a DTD also available in the databset. The dataset is contributed by Avigdor Gal et al from Technion at Israel Institute of Technology.
Tasks
- Schema
Matching:
This task
matches query schemas across different Web query interfaces for sources in the
same domain.
- Query
Capabilities Extraction:
This task extracts the query capabilities of
searchable databases from their Web query interfaces.
Using This Repository
This repository is publicly available to
facilitate research in the related areas of Web integration. If you publish
material based on datasets or tasks in this repository, please refer to the
source as follows, to help others to obtain the same datasets and reproduce your
experiments.
The UIUC web integration
repository. Computer Science Department, University of Illinois at
Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository, 2003.
Bibtex entry:
@MISC{UIUCRepository,
title = "The
{UIUC} Web Integration Repository",
year = "2003",
howpublished =
"Computer Science Department, University of Illinois at Urbana-Champaign.
http://metaquerier.cs.uiuc.edu/repository"
}
Donation
We welcome donation of additional datasets. Please contact
binhe[at]uiuc.edu.
For questions and suggestions, please contact
binhe[at]uiuc.edu.