Large-Scale Deep Web Integration:
Incomplete Bibliography
prepared for a tutorial in SIGMOD 2006

This bibliography contains references from research literature for integrating structured data on the deep Web, in a large scale-- or, "search-flavored" integration. It was first prepared for a tutorial in SIGMOD 2006. While we focus on large-scale structured data integration issues, with the active research going on in the area, the bibliography cannot be complete. We are continuing to build this bibliography-- if you see any reference missing, please contact me (kcchang at cs uiuc edu). Thanks.
  Deep Web Survey:
  1. Structured Databases on the Web: Observations and Implications.. K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. SIGMOD Record, 33(3):61-70, 2004.

  2. The Deep Web: Surfacing Hidden Value. M. K. Bergman. Technical report, BrightPlanet LLC, December 2000.

  Source Modeling and Selection:
  1. Searching for Hidden-Web Databases.. L. Barbosa and J. Freire. In WebDB, pages 1-6, 2005.

  2. Query Routing: Finding Ways in the Maze of the DeepWeb.. G. Kabra, C. Li, and K. C.-C. Chang. In WIRI, pages 64-73, 2005.

  3. Organizing structured web sources by query schemas: a clustering approach.. B. He, T. Tao, and K. C.-C. Chang. In CIKM, pages 22-31, 2004.

  4. Understanding Web Query Interfaces: Best-Effort Parsing with Hidden Syntax.. Z. Zhang, B. He, and K. C.-C. Chang. In SIGMOD Conference, pages 107-118, 2004.

  5. Automatic extraction of web search interfaces for interface schema integration.. H. He, W. Meng, C. T. Yu, and Z. Wu. In WWW (Alternate Track Papers & Posters), pages 414-415, 2004.

  6. Clustering e-commerce search engines.. Q. Peng, W. Meng, H. He, and C. T. Yu. In WWW (Alternate Track Papers & Posters), pages 416-417, 2004.

  7. Learning to Invoke Web Forms.. N. Kushmerick. In CoopIS/DOA/ODBASE, pages 997-1013, 2003.

  8. Learning to Attach Semantic Metadata to Web Services.. A. Heß and N. Kushmerick. In International Semantic Web Conference, pages 258-273, 2003.

  9. A Layered Architecture for Querying Dynamic Web Content.. H. Davulcu, J. Freire, M. Kifer, and I. V. Ramakrishnan. In SIGMOD Conference, pages 491-502, 1999.

  10. Modeling Web Sources for Information Integration.. C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, P. J. Modi, I. Muslea, A. Philpot, and S. Tejada. In AAAI/IAAI, pages 211-218, 1998.

  Schema Matching:
  1. WebIQ: Learning from the Web to Match Deep-Web Query Interfaces.. W. Wu, A. Doan, and C. T. Yu. In ICDE, page 44, 2006.

  2. Making holistic schema matching robust: an ensemble approach.. B. He and K. C.-C. Chang. In KDD, pages 429-438, 2005.

  3. Merging Interface Schemas on the Deep Web via Clustering Aggregation.. W. Wu, A. Doan, and C. T. Yu. In ICDM, pages 801-804, 2005.

  4. Corpus-based Schema Matching.. J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. In ICDE, pages 57-68, 2005.

  5. An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web.. W. Wu, C. T. Yu, A. Doan, and W. Meng. In SIGMOD Conference, pages 95-106, 2004.

  6. Instance-based Schema Matching for Web Databases by Domain-specific Query Probing.. J. Wang, J.-R. Wen, F. H. Lochovsky, and W.-Y. Ma. In VLDB, pages 408-419, 2004.

  7. Discovering complex matchings across web query interfaces: a correlation mining approach.. B. He, K. C.-C. Chang, and J. Han. In KDD, pages 148-157, 2004.

  8. A Holistic Paradigm for Large Scale Schema Matching.. B. He and K. C.-C. Chang. SIGMOD Record, 33(4):20-25, 2004.

  9. WISE-Integrator: An Automatic Integrator of Web Search Interfaces for E-Commerce.. H. He, W. Meng, C. T. Yu, and Z. Wu. In VLDB, pages 357-368, 2003.

  10. Statistical Schema Matching across Web Query Interfaces.. B. He and K. C.-C. Chang. In SIGMOD Conference, pages 217-228, 2003.

  11. A survey of approaches to automatic schema matching.. E. Rahm and P. A. Bernstein. VLDB J., 10(4):334-350, 2001.

  Source Querying, Crawling & Search:
  1. Query Selection Techniques for Efficient Crawling of Structured Web Sources.. P. Wu, J.-R. Wen, H. Liu, and W.-Y. Ma. In ICDE, page 47, 2006.

  2. Object-level ranking: bringing order to Web objects.. Z. Nie, Y. Zhang, J.-R. Wen, and W.-Y. Ma. In WWW, pages 567-574, 2005.

  3. Light-weight Domain-based Form Assistant: Querying Web Databases On the Fly.. Z. Zhang, B. He, and K. C.-C. Chang. In VLDB, pages 97-108, 2005.

  4. Siphoning Hidden-Web Data through Keyword-Based Interfaces.. L. Barbosa and J. Freire. In SBBD, pages 309-321, 2004.

  5. Exploiting a Search Engine to Develop More Flexible Web Agents.. S. de Lin and C. A. Knoblock. In Web Intelligence, pages 54-60, 2003.

  6. Crawling the Hidden Web.. S. Raghavan and H. Garcia-Molina. In VLDB, pages 129-138, 2001.

  Data Extraction:
  1. Interactive Wrapper Generation with Minimal User Effort.. U. Irmak and T. Suel. In WWW, 2006.

  2. Extracting Objects from the Web.. Z. Nie, F. Wu, J.-R. Wen, and W.-Y. Ma. In ICDE, page 123, 2006.

  3. 2D Conditional Random Fields for Web information extraction.. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. In ICML, pages 1044-1051, 2005.

  4. Semantic annotation of unstructured and ungrammatical text.. M. Michelson and C. A. Knoblock. In IJCAI, pages 1091-1098, 2005.

  5. Fully automatic wrapper generation for search engines.. H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. T. Yu. In WWW, pages 66-75, 2005.

  6. Web data extraction based on partial tree alignment.. Y. Zhai and B. Liu. In WWW, pages 76-85, 2005.

  7. Using the Structure of Web Sites for Automatic Segmentation of Tables.. K. Lerman, L. Getoor, S. Minton, and C. A. Knoblock. In SIGMOD Conference, pages 119-130, 2004.

  8. Automatic information extraction from large websites.. V. Crescenzi and G. Mecca. J. ACM, 51(5):731-779, 2004.

  9. Mining data records in Web pages.. B. Liu, R. L. Grossman, and Y. Zhai. In KDD, pages 601-606, 2003.

  10. Data extraction and label assignment for web databases.. J. Wang and F. H. Lochovsky. In WWW, pages 187-196, 2003.

  11. Extracting Structured Data from Web Pages.. A. Arasu and H. Garcia-Molina. In SIGMOD Conference, pages 337-348, 2003.

  12. Wrapper Maintenance: A Machine Learning Approach.. K. Lerman, S. Minton, and C. A. Knoblock. J. Artif. Intell. Res. (JAIR), 18:149-181, 2003.

  13. Automatic annotation of data extracted from large Web sites.. L. Arlotta, V. Crescenzi, G. Mecca, and P. Merialdo. In WebDB, pages 7-12, 2003.

  14. Finite-State Approaches to Web Information Extraction.. N. Kushmerick. In SCIE, pages 77-91, 2002.

  15. RoadRunner: Towards Automatic Data Extraction from Large Web Sites.. V. Crescenzi, G. Mecca, and P. Merialdo. In VLDB, pages 109-118, 2001.

  16. Boosted Wrapper Induction.. D. Freitag and N. Kushmerick. In AAAI/IAAI, pages 577-583, 2000.

  17. Wrapper induction: Efficiency and expressiveness.. N. Kushmerick. Artif. Intell., 118(1-2):15-68, 2000.

  18. Selective Sampling with Redundant Views.. I. Muslea, S. Minton, and C. A. Knoblock. In AAAI/IAAI, pages 621-626, 2000.

  19. Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach.. C. A. Knoblock, K. Lerman, S. Minton, and I. Muslea. IEEE Data Eng. Bull., 23(4):33-41, 2000.

  20. Learning the Common Structure of Data.. K. Lerman and S. Minton. In AAAI/IAAI, pages 609-614, 2000.

  21. Regression testing for wrapper maintenance.. N. Kushmerick. In AAAI/IAAI, pages 74-79, 1999.

  22. A Hierarchical Approach to Wrapper Induction.. I. Muslea, S. Minton, and C. A. Knoblock. In Agents, pages 190-197, 1999.

  23. Active Learning for Hierarchical Wrapper Induction.. I. Muslea, S. Minton, and C. A. Knoblock. In AAAI/IAAI, page 975, 1999.

  24. Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web.. C.-N. Hsu and M.-T. Dung. Inf. Syst., 23(8):521-538, 1998.

  25. Semi-Automatic Wrapper Generation for Internet Information Sources.. N. Ashish and C. A. Knoblock. In CoopIS, pages 160-169, 1997.

  26. Wrapper Induction for Information Extraction.. N. Kushmerick, D. S. Weld, and R. B. Doorenbos. In IJCAI (1), pages 729-737, 1997.

  27. Mediators in the Architecture of Future Information Systems.. G. Wiederhold. IEEE Computer, 25(3):38-49, 1992.

  System Integration:
  1. Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web.. K. C.-C. Chang, B. He, and Z. Zhang. In CIDR, pages 44-55, 2005.

  2. Mining semantics for large scale integration on the web: evidences, insights, and challenges.. K. C.-C. Chang, B. He, and Z. Zhang. SIGKDD Explorations, 6(2):67-76, 2004.

  3. Retrieving and Semantically Integrating Heterogeneous Data from the Web.. M. Michalowski, J. L. Ambite, C. A. Knoblock, S. Minton, S. Thakkar, and R. Tuchinda. IEEE Intelligent Systems, 19(3):72-79, 2004.

  4. Deploying Information Agents on the Web.. C. A. Knoblock. In IJCAI, pages 1580-1586, 2003.

  5. The Ariadne Approach to Web-Based Information Integration.. C. A. Knoblock, S. Minton, J. L. Ambite, N. Ashish, I. Muslea, A. Philpot, and S. Tejada. Int. J. Cooperative Inf. Syst., 10(1-2):145-169, 2001.

  6. Virtual Database technology.. A. Gupta, V. Harinarayan, and A. Rajaraman. SIGMOD Record, 26(4):57-61, 1997.

  Other References:
  1. Compiling Source Descriptions for Efficient and Flexible Information Integration.. J. L. Ambite, C. A. Knoblock, I. Muslea, and A. Philpot. J. Intell. Inf. Syst., 16(2):149-187, 2001.

  2. Computing Capabilities of Mediators.. R. Yerneni, C. Li, H. Garcia-Molina, and J. D. Ullman. In SIGMOD Conference, pages 443-454, 1999.

  3. Mixed-initiative, multi-source information assistants.. C. A. Knoblock, S. Minton, J. L. Ambite, M. Muslea, J. Oh, and M. Frank. In WWW, pages 697-707, 2001.

  4. Adapting to Source Properties in Processing Data Integration Queries.. Z. G. Ives, A. Y. Halevy, and D. S. Weld. In SIGMOD Conference, pages 395-406, 2004.

  5. Learning Value Predictors for the Speculative Execution of Information Gathering Plans.. G. Barish and C. A. Knoblock. In IJCAI, pages 3-9, 2003.

  6. Getting from Here to There: Interactive Planning and Agent Execution for Optimizing Travel.. J. L. Ambite, G. Barish, C. A. Knoblock, M. Muslea, J. Oh, and S. Minton. In AAAI/IAAI, pages 862-869, 2002.

  7. Speculative Execution for Information Gathering Plans.. G. Barish and C. A. Knoblock. In AIPS, pages 184-193, 2002.

  8. Selectively Materializing Data in Mediators by Analyzing User Queries.. N. Ashish, C. A. Knoblock, and C. Shahabi. Int. J. Cooperative Inf. Syst., 11(1-2):119-144, 2002.

  9. Form-Based Proxy Caching for Database-Backed Web Sites.. Q. Luo and J. F. Naughton. In VLDB, pages 191-200, 2001.

  10. The Niagara Internet Query System.. J. F. Naughton, D. J. DeWitt, D. Maier, A. Aboulnaga, J. Chen, L. Galanis, J. Kang, R. Krishnamurthy, Q. Luo, N. Prakash, R. Ramamurthy, J. Shanmugasundaram, F. Tian, K. Tufte, S. Viglas, Y. Wang, C. Zhang, B. Jackson, A. Gupta, and R. Chen. IEEE Data Eng. Bull., 24(2):27-33, 2001.

  11. Answering queries using views: A survey.. A. Y. Halevy. VLDB J., 10(4):270-294, 2001.

  12. Flexible and scalable cost-based query planning in mediators: A transformational approach.. J. L. Ambite and C. A. Knoblock. Artif. Intell., 118(1-2):115-161, 2000.

  13. Planning by Rewriting: Efficiently Generating High-Quality Plans.. J. L. Ambite and C. A. Knoblock. In AAAI/IAAI, pages 706-713, 1997.

  14. Querying Heterogeneous Information Sources Using Source Descriptions. A. Y. Levy, A. Rajaraman, and J. J. Ordille. In VLDB, pages 251-262, 1996.

Kevin C. Chang,