HTTP Parser guide Tuesday 25 May, 2004

}-=Loki=-{
lokiwashere@yahoo.co.nz

The following steps will talk you through how to grab information from a website easily to use in your samurize scripts

First, browse to the website you want to get the information from.




Then, go ALT->View Source




Now this is the tricky aprt. You need to find some unique way of locating the text you want. e.g. A heading before the line you want. Or a sequence of html strings that doesn't appear anywhere else.

For this example, i searched the source, and found that the text “width="70%"><font class="normal"> <a href="showthread.php?s=” appeared just before each thread name (which is what I was after)




make an INI file for this page. For this example, I just copied the HTTPBase.ini file and renamed it to NZGames_Forums.ini




Now we want to edit the ini file. Put the full URL address in the [Settings] URL= part.
We also want to give it a temp filename to save the html file to while it works on it.




You'll notice above that I have made a section (a section is denoted by [ ] around it) called [Thread 1 name]
Inside it, you must have a Default=** line. Put asterisks (*) around the text. This is so you can use spaces. To have a default of space you'd simply have Default=* *
Find before= This tells the parser to search for this text (between the * *) and then wait after the last character.
I am telling the parser to search for the unique string I found earlier (width="70%"><font class="normal"> <a href="showthread.php?s=) with * * around it.
I then tell the parser to find the next > because the > appears just before the thread name i want. (see the source file from step 02)
Copy Until= means we want the text between the > and the </a> (in this case it turns out to be “Manhunt”)
When the parser runs it will download the html, and do this search, and put the result in the Result= value.

To grab the next field I wanted (the name of the person that started the thread) I made another section called
[Thread 1 Starter]
Because each search starts from the beginning of the file, I tell it to search for the width="70%"><font class="normal"> <a href="showthread.php?s= and then find the part about the thread starter (which was <a href=”member.php?s=, and then get the text between > and </a>

I do the same thing for [Thread 1 Replies]



Now I want to get the 2nd thread's information. I search for the unique string before each Thread twice so that I am at the start of the 2nd thread's information. The rest is similar to above.




Now we want to make a simple config to test this. Add a Plugin meter.




Choose HTTPParser and DownloadAndParseByMinutes (which allows us to select the refresh interval (default 15 mins) The parser will never refresh this sooner than this setting (unless a reload is done).
Enter the full path and name of the ini file we made as the “Schema name”




Set the frequency of downloading/parsing (every 3 hours should be fine for most things. i.e. 180 mins)




Now lets test the plugin. You should get “Parsed OK” the first time it runs. (Every other time you “check”, it should say the time it downloaded and parsed)




I use an activescript to read the results from the ini file. I found this better than a plugin poll, which occurs way too frequently for my liking. (I like my pc to be almost completely idle when it should be)










Enter the section name (e.g. Thread 1 name)




Now enter the result key name. You can have more than 1 result, which you'll learn to understand by looking at the other ini schemas. For now, we just use Result




and any default (this will usually be overwritten with the default you chose in the ini file)




Now enter the full path and filename of the INI file we made.


Now lets test it. In this example, it should return “Manhunt






A more complete NZGames_Forums.ini is included in the scritps directory for you to look at. The other scripts demonstrate the use of the Format and Translate routines.