Hi,
I am working on a c# web extractor application (winform), the users of the application are not familiar with creating regex, xpath to configure the application.
I need to find a way to make the application learn to recognize parts of web page.
Like a typical web page will have class, ids ( asuming all have ) which can be identified (there are exceptions but I am ignoring them for now).
Example:
the header will have <div id/class="top-header"> or <div id/class="header">
the content part will have <div id/class="main-content"> or <div id/class="content">
Yes, of-course all will not have class, id or they will be named like "hd-100".... some will have tables, and I am aware that there are a lot of different methods to identify the parts of a webpage, but I want to ignore all these for now.
I want to use Infer.Net to classify different parts of a web page, based on previous supervised training.
Supervised training, would be like:
Label: Header | ClassOrId: header | Position: 10% of the page | Tag: div
Label: Header | ClassOrId: top-header | Position: 15% of the page | Tag: div
Label: Footer | ClassOrId: footer | Position: 95% of the page | Tag: div
here the tag can be enum, position is int, and classorid is string.
the Infer.Net will have to identify the header part from the dom (using the training data), from a given never before seen webpage,
yes there are a more flaws than solution, but still please stick to the requirement, train/ classify text using Infer.Net.
1) Is this possible? using Infer.Net
2) If not how do I solve the problem,
All possible, good, bad, useful, useless suggestions are welcome. :))
mr.milan.solanki@gmail.com