This online demo shows the work discussed in the following paper.
Understanding Web Query Interfaces: Best-effort Parsing with Hidden Syntax. Z. Zhang, B. He and K. C.-C. Chang. In Proceedings
of the 2004 ACM SIGMOD Conference (SIGMOD 2004), Paris, France,
June 2004. [PDF]
In particular, the demo shows the query capabilities extracted from query interfaces of various domains using a single carefully crafted global grammar. The demo contains about 100 extracted query interfaces from TEL-8 Query Interface Dataset.
Content of Demo
In the demo, we show
- A complete example of parsing aa.com interface, which includes:
- The fragment of the interface that contains only the query form.
Our form extractor works on the separated form instead of the original interface because the separated form is much cleaner and contains only relevant information to the form extractor. We implement a tool form separator to isolate the form part out of a query interface.
- A set of tokens that the parser works on.
Tokenization is a preprocessing step to prepare a set of tokens to the parser. To make it more concrete, we show the tokenization result for aa.com. The result is in an XML format. It records, for each token, a set of <attribute, value> pairs. The common attributes shared by all the tokens are: type indicating the type of the token, position (left, right, top, bottom) recording its coordinates in the original HTML page, and path recording its path in the HTML parse tree. Other attributes include, for example, value for text tokens to record their string value and Enumeration for selectionlist tokens to record their options.
- The extracted query capability output by the parser.
We render the result into a standard HTML format for better visualization. The result contains three parts:
Constraints: report all the conditions in the interface in the format of <Attribute, Operator, Value>.
Conflicts: report the conflicts among extracted conditions.
Missing Elements: report the elements that are not interpreted by the form extractor.
- More experimental details in the global study, which include:
- The global grammar that the best-effort parser used for parsing understanding the query interfaces.
The file is in an
XML format. Each rule element in the file specifies a production in the grammar. The rule element defines:
head: the head symbol of the production
body: the component symbols (by node element) and the constraints to be satisfied (by edge element) in the production. The predicates (e.g., LeftAdjacent) defining the constraints are Boolean functions separately implemented in Python.
- Query capabilities that the form extractor extracts for some of the experimental data in the global study.
Here we show some concrete results of extracted query capabilities. In particular, for each query interface, we show:
Interface Page: the original query interface.
Original Form: the fragment of the interface that contains only the query form.
Extracted Query Capability: the result output by the form extractor rendered in the standard HTML format.