This seminar is hosted by the Institute for the Future of Computing and will be held at the Conference Room, Oxford eResearch Unit, 7 Keble Road OX1 3QG and will start with a buffet lunch at 1 pm.
This seminar will be of interest to anybody whose research or business decisions could be improved through the availability of web data, or to anybody genuinely interested in web data extraction.
Abstract: We will discuss the latest technology for automatically locating and extracting data from disparate websites and making these data usable for better search and decision-making.
In particular, we focus on the ongoing DIADEM project (funded by the European Research Council) which aims to automate data extraction fully in given large application domains. Using prior knowledge of specific domains (e.g., real-estate, used-cars, immunology), DIADEM automatically explores the target website and locates the entities of the domain and their attributes. DIADEM learns how relevant data are structured in the web site and generates a program that rapidly extracts them.
The DIADEM prototype is now ready for being used in other domains, too. We will describe our first experiments with the used-car domain. Moreover, DIADEM can extract useful data from other web documents than HTML pages such as PDF documents.
As we will demonstrate, our novel approach is successful. Our prototype system can, for example, correctly analyse over 99% of all UK real-estate websites and extract the relevant data from these sites.
The web contains very large amounts of data that can be used for decision making in research and business. However, in order to make them available to analysts, we need to collect them from the web, structure and feed them to a user application or to a database. Search engines such as Google provide us with thousands of documents corresponding to user keywords, but fail to extract structured data objects from them. Modern data extraction tools, also known as wrapper generators, may help. They can be trained to extract specific data items such as article-price pairs from online catalogues, or descriptions of houses for sale from real-estate websites. These tools, however, need to be manually trained for each single target website, from which they are supposed to extract data, and are thus not suited for covering large application domains, such as, for example, the UK real estate domain, which consists of more than 17.000 differently structured websites.
Please register your interest with Kate Pitts (kate.pitts@oerc.ox.ac.uk)