Text Classification using Infer.Net
-
venerdì 16 settembre 2011 08:10
Hi,
I am working on a c# web extractor application (winform), the users of the application are not familiar with creating regex, xpath to configure the application.
I need to find a way to make the application learn to recognize parts of web page.
Like a typical web page will have class, ids ( asuming all have ) which can be identified (there are exceptions but I am ignoring them for now).
Example:
the header will have <div id/class="top-header"> or <div id/class="header">
the content part will have <div id/class="main-content"> or <div id/class="content">
Yes, of-course all will not have class, id or they will be named like "hd-100".... some will have tables, and I am aware that there are a lot of different methods to identify the parts of a webpage, but I want to ignore all these for now.
I want to use Infer.Net to classify different parts of a web page, based on previous supervised training.
Supervised training, would be like:
Label: Header | ClassOrId: header | Position: 10% of the page | Tag: div
Label: Header | ClassOrId: top-header | Position: 15% of the page | Tag: div
Label: Footer | ClassOrId: footer | Position: 95% of the page | Tag: divhere the tag can be enum, position is int, and classorid is string.
the Infer.Net will have to identify the header part from the dom (using the training data), from a given never before seen webpage, yes there are a more flaws than solution, but still please stick to the requirement, train/ classify text using Infer.Net.
1) Is this possible? using Infer.Net
2) If not how do I solve the problem,
All possible, good, bad, useful, useless suggestions are welcome. :))
mr.milan.solanki@gmail.com- Modificato milan.s venerdì 16 settembre 2011 08:17
Tutte le risposte
-
venerdì 16 settembre 2011 13:49Proprietario
The first type of model to consider would be a Multi-class Bayes Point Machine. You should use the sparse version, as you will encode text by having a feature for every word in your dictionary, and maintain a weight for each word, but only a few words (having feature value 1) may be present for a given training example. Other attributes such as position on page can be encoded as a single feature. The BPM example code is part of the Infer.NET installation. You will need to write a bit of code that maps from your raw data to the observed values of the indices and values.
John