Extracting Data from HTTPS request RRS feed

  • Question

  • I'm looking to create a page count extractor by scraping the printer's informational page for the current page count and serial number.

    The issue I'm finding is that while I can pull the page's data into a string, I'm unable to easily extract the content I'm looking for.

    Here's the code I currently scraped together.  I had to tell the script to ignore SSL errors - which worked. (in case you are wondering about the validation check).

    Unfortunately the attempt at isolating the string with select-string wasn't working because of the data not being broken down with line feeds.

    $printdata = Get-Content -Path "C:\printerlist.txt"
    foreach ($printer in $printdata)
        [System.Net.ServicePointManager]::ServerCertificateValidationCallback = {$true}
        $web = New-Object Net.WebClient
        $webstring = "https://"+$printer+"/cgi-bin/dynamic/printer/config/reports/deviceinfo.html"
        $pagedat = $web.DownloadString("$webstring")
        $printerserial = (Select-String -InputObject $pagedat -Pattern "Serial").Line
        $printercount = (Select-String -InputObject $pagedat -Pattern "Page Count").Line
        Write-Host $printer, $printerserial, $printercount
     #| Out-File -FilePath "C:\Users\badler\Desktop\pagecounts\pagecount_output.log" -Force -Append
    # Clear Variables

    Below is a sample of a page - the data I really need is highlighted, but I'm having difficulty isolating it as it's never 100% consistent.  Can someone point me in a good direction to be able to separate out the string?  There are inconsistent line breaks as well in the data pulled into the $pagedat string.

    <head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
    <link REL="stylesheet" HREF="/configStyle.css" TYPE="text/css"><title>Print Menus</title>
    <body bgcolor="FFFFFF">
    <center><FONT STYLE="font-family: sans-serif; font-size: 20pt; font-weight: bold; color: #0000EE">Lexmark T650</FONT></center><TABLE><br><TR><td><p align="left" style="margin-left: 0;"><b>Device Information</b></p></td><td><p>  </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Page Count</p></td><td><p> =  96874  </p></td>
    <TR><td><p align="left" style="margin-">Installed Memory</p></td><td><p> =  128 MB </p></td>
    <TR><td><p align="left" style="margin-">Processor Speed</p></td><td><p> =  500 MHz </p></td>
    <TR><td><p align="left" style="margin-">Serial Number</p></td><td><p> =  793TMF3 </p></td>
    <TR><td><p align="left" style="margin-left: 10;">TLI</p></td><td><p> =  30G0100 </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Machine Type</p></td><td><p> =  4062-01A </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Asset Tag</p></td><td><p> =   </p></td>
    <TR><td><p align="left" style="margin-left: 10;">CalStat</p></td><td><p> =  100e2cff </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Engine ID</p></td><td><p> =  45 </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Fuser Type</p></td><td><p> =  1 </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Loader</p></td><td><p> =  LR.JP.P311e2-0C </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Kernel</p></td><td><p> =  FPR.APS.F267c2-0C </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Base</p></td><td><p> =  LR.JP.P311e2-0C </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Network</p></td><td><p> =  NR.APS.N447b2-0C </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Network Drvr</p></td><td><p> =  LR.JP.P311e2-0C </p></td>
    <TR><td><p align="left" style="margin-left: 10;">P-Scribe</p></td><td><p> =  P41f </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Engine</p></td><td><p> =  AR.JP.E205-0 </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Panel</p></td><td><p> =  9.9 </p></td>
    <TR><td><p align="left" style="margin-left: 10;">Font</p></td><td><p> =  8.11H01-U5.0 </p></td>

    • Moved by Bill_Stewart Monday, April 30, 2018 8:47 PM Abandoned
    Thursday, August 3, 2017 6:29 PM

All replies

  • Hi BAWrites,

    You will need to look at Regular Expressions to extract the information.

    https://technet.microsoft.com/en-us/library/2007.11.powershell.aspx?f=255&MSPPError=-2147217396 &

    https://mcpmag.com/articles/2015/09/30/regex-groups-with-powershell.aspx I found useful.

    I will try and get a chance to write a Regular Expression for what you are looking for. I'm still learning them so takes me a bit. 

    Thanks, Tim.

    Thanks, Tim. | Please remember to mark the replies as answers if they help. |

    Saturday, August 5, 2017 7:15 AM
  • Hi again BAWrites,

    I assigned $htmlstring to your HTML output above.

    Please see Regular Expression below:

    [regex]::Matches( $htmlstring, '(?<=Page\sCount</p></td><td><p>\s=\s\s)\w+').value
    [regex]::Matches( $htmlstring, '(?<=Serial\sNumber</p></td><td><p>\s=\s\s)\w+').value

    I also used RegEx101 to help.

    Thanks, Tim.

    Thanks, Tim. | Please remember to mark the replies as answers if they help. |

    Saturday, August 5, 2017 9:31 PM
  • Thanks!  For some reason the responses to this post hadn't shown up in my gmail account.

    I'll check out both the regular expressions link you provided and try out the regex you posted above!  

    Tuesday, August 15, 2017 2:00 PM