venerdì 16 settembre 2011 08:10
I am working on a c# web extractor application (winform), the users of the application are not familiar with creating regex, xpath to configure the application.
I need to find a way to make the application learn to recognize parts of web page.
Like a typical web page will have class, ids ( asuming all have ) which can be identified (there are exceptions but I am ignoring them for now).
the header will have <div id/class="top-header"> or <div id/class="header">
the content part will have <div id/class="main-content"> or <div id/class="content">
Yes, of-course all will not have class, id or they will be named like "hd-100".... some will have tables, and I am aware that there are a lot of different methods to identify the parts of a webpage, but I want to ignore all these for now.
I want to use Infer.Net to classify different parts of a web page, based on previous supervised training.
Supervised training, would be like:
Label: Header | ClassOrId: header | Position: 10% of the page | Tag: div
Label: Header | ClassOrId: top-header | Position: 15% of the page | Tag: div
Label: Footer | ClassOrId: footer | Position: 95% of the page | Tag: div
here the tag can be enum, position is int, and classorid is string.
the Infer.Net will have to identify the header part from the dom (using the training data), from a given never before seen webpage, yes there are a more flaws than solution, but still please stick to the requirement, train/ classify text using Infer.Net.
1) Is this possible? using Infer.Net
2) If not how do I solve the problem,
All possible, good, bad, useful, useless suggestions are welcome. :))
- Modificato milan.s venerdì 16 settembre 2011 08:17
Tutte le risposte
venerdì 16 settembre 2011 13:49Proprietario
The first type of model to consider would be a Multi-class Bayes Point Machine. You should use the sparse version, as you will encode text by having a feature for every word in your dictionary, and maintain a weight for each word, but only a few words (having feature value 1) may be present for a given training example. Other attributes such as position on page can be encoded as a single feature. The BPM example code is part of the Infer.NET installation. You will need to write a bit of code that maps from your raw data to the observed values of the indices and values.