HTTP Parser guide Tuesday 25 May, 2004
}-=Loki=-{
lokiwashere@yahoo.co.nz
The following steps will talk you through how to grab information from a website easily to use in your samurize scripts
First, browse to the website you want to get the information from.
Then, go ALT->View Source
Now this is the tricky aprt. You need to find some unique way of locating the text you want. e.g. A heading before the line you want. Or a sequence of html strings that doesn't appear anywhere else.
For this example, i searched the source, and found that the text “width="70%"><font class="normal"> <a href="showthread.php?s=” appeared just before each thread name (which is what I was after)
make an INI file for this page. For this example, I just copied the HTTPBase.ini file and renamed it to NZGames_Forums.ini
Now we want to edit the ini file. Put
the full URL address in the [Settings] URL= part.
We also
want to give it a temp filename to save the html file to while it
works on it.
You'll notice above that I have made a section (a section is
denoted by [ ] around it) called [Thread 1 name]
Inside it, you must have a Default=** line. Put
asterisks (*) around the text. This is so you can use spaces. To have
a default of space you'd simply have Default=* *
Find
before= This tells the parser to search for this text (between
the * *) and then wait after the last character.
I am telling the
parser to search for the unique string I found earlier
(width="70%"><font class="normal"> <a
href="showthread.php?s=) with * * around it.
I then tell
the parser to find the next > because the >
appears just before the thread name i want. (see the source file from
step 02)
Copy Until= means we want the text between the >
and the </a> (in this case it turns out to be
“Manhunt”)
When the parser runs it will
download the html, and do this search, and put the result in the
Result= value.
To grab the next field I wanted (the
name of the person that started the thread) I made another section
called
[Thread 1 Starter]
Because each search starts
from the beginning of the file, I tell it to search for the
width="70%"><font class="normal"> <a
href="showthread.php?s= and then find the part about the
thread starter (which was <a href=”member.php?s=, and
then get the text between > and </a>
I do the same thing for [Thread 1 Replies]
Now I want to get the 2nd thread's information. I search for the unique string before each Thread twice so that I am at the start of the 2nd thread's information. The rest is similar to above.
Now we want to make a simple config to test this. Add a Plugin meter.
Choose HTTPParser
and DownloadAndParseByMinutes (which allows us to select the refresh
interval (default 15 mins) The parser will never refresh this sooner
than this setting (unless a reload is done).
Enter the full path
and name of the ini file we made as the “Schema name”
Set the frequency of downloading/parsing (every 3 hours should be fine for most things. i.e. 180 mins)
Now lets test the plugin. You should get “Parsed OK” the first time it runs. (Every other time you “check”, it should say the time it downloaded and parsed)
I use an activescript to read the results from the ini file. I found this better than a plugin poll, which occurs way too frequently for my liking. (I like my pc to be almost completely idle when it should be)
Enter the section name (e.g. Thread 1 name)
Now enter the result key name. You can have more than 1 result, which you'll learn to understand by looking at the other ini schemas. For now, we just use Result
and any default (this will usually be overwritten with the default you chose in the ini file)
Now enter the full path and filename of the INI file we made.
Now
lets test it. In this example, it should return “Manhunt”
A more complete NZGames_Forums.ini is included in the scritps directory for you to look at. The other scripts demonstrate the use of the Format and Translate routines.