I downloaded PowerShell 4 (for Windows 7) yesterday and spent a few hours using Google to try to learn from what other people have done. I got a script working to list a set of URLs that had a certain word on the page, so I thought it wouldn't be too hard to get a script working to gather URLs from a set of search results.
Unfortunately, I'm totally confused. I've seen a few different functions used for similar purposes, and can't make sense of where to start.
Here's what I think I understand, based off of the scripts I've seen...
- I probably need to use [ $browserAgent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36' ].
My understanding is that doing so would make it so that the pages display to PS the way they would to Firefox. If I just use [ Invoke-WebRequest -URI $Site ] from PowerShell, the number of links is smaller than what shows up if I browse the site in Firefox.
- I need a function that adds a page number to the end of a base url ( e.g. [ ($num=1; $num -le 750; $num++) ] for a search that has 748 pages of results ).
I haven't seen a way to get all results from a search as if the results weren't split into pages.
- I need to send the urls individually to an $outarray so that I can use [ $outarray += $i + "`r`n" ] in my script with [ $outarray | Out-File D:\powershellurlsresults.txt -width 220 ] to save the url list.
$i would be the urls collected from each page, each gathered one-by-one as part of a loop that runs per-page.
Ideally, there would be a way to grab links that have an <a title="> that includes the word "submission(s)". I know that [ if ($output -like "*submission(s)*") ] can be used to operate on pages that contain a certain word, so I'm hoping there's a way to grab a link by html title.
For the purpose of this script, maybe [ { $_.href -like '*submissions*' } ] would do well enough, since each link to be collected ends in "page=submissions" ( e.g. http://www.thesite.com/content/grouppage.php?uid=49281203&page=submissions ).
If I start with the base...
# PowerShell Invoke-WebRequest Search Example
$Site = "http://search.thesite.com/search.php?type=long&content=submissions&page=1"
$Test = Invoke-WebRequest -URI $Site
$Test.Links | Foreach {$_.href
}
I can get a list of urls, but it converts & to [ & ]. I need to stringreplace fix that.
Here's my current scratchpad:
$url = "$urlstart$num"
$urlstart = "http://search.thesite.com/search.php?type=long&content=submissions&page="
$num ="1 -le 750; $num++"
$browserAgent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36'
$page = Invoke-WebRequest -Uri $url -UserAgent $browserAgent
$page.Links
| Where-Object { $_.href -like '*submissions*' }
| ForEach-Object { $_.href }
$outarray += $page.links + "`r`n"
$outarray | Out-File D:\powershellurlsresults.txt -width 220
For all I know, I'm completely off-base with the script so far. Can anyone help me explain what's wrong or what I need to add, and why the changes need to be made?
I could probably modify my other working script to accomplish what I'm trying to, but I'd like to get this one working. ( The other script takes a list of URLs in a text file, scans each URL for a word, and returns all URLs that contain that word. I could probably modify it so that it extracts urls from each page if the urls include "submission" in the url. However, it seems silly to generate several hundred urls in a text file if there's a way to automate grabbing urls from every page of a search. )