Website Design United States, Website Design California, Website Designing United States, Website Designing California

Tutorial: Converting a phpBB forum into an RSS feed using Python.

RSS/Atom feeds are taking over the web - people like to keep up to date with the latest news, events, and ramblings, but they want to do so from a familiar, quick loading environment. A feed is just a page of dynamically generated content, in a markup form that a feed reader (which you use to read these feeds) can interperet. To generate your own RSS feed, you either need some form of feed generating program, or knowledge of some kind of programming language which has parsing/regular expression capabilities. Python is such an example of such a language, and this tutorial will show you how to make an RSS feed from any phpBB forum page using it.

First things first: We need to import the modules we will need to download a phpBB page, search for certain patterns, and spit the page back out in an XML format: urllib (a high level wrapper around socket and other modules, specifically for downloading over the net) and re (regular expressions)

#!/usr/bin/python
import urllib, re
# phpbb2rss.py - convert a phpBB board into an RSS feed

Then we need to send out a header to the web browser or rss reader, telling it what type of content we're going to send over the net.

print "Content-Type: application/rss+xml"
print # blank line

Now we need to set up a few variables which we will use later on. The page we want to convert to a feed; the page title; the page description.

phpbb_page = "http://programmers-corner.com/forums/viewforum.php?f=11"
phpbb_pagetitle = "Python"
phpbb_pagedescr = "Python - Programmer's Corner Python Support"

We need to start setting up the rss page next. We'll store the following required information in a variable named rss:

rss = """<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0">
<channel>
<title>""" + phpbb_pagetitle + """</title>
<link>""" + phpbb_page + """</link>
<description>""" + phpbb_pagedescr + """</description>
"""

Now it starts to get tricker. We need to set up a regular expression to match links on the forum page that we know will point to topics. View the source of a phpBB forum page, and you'll see that whenever there is a topic, its given to the browser as a link with the class "topictitle". So we're going to look for this:

reg = re.compile("<a href=\".+?\" class=\"topictitle\">.+?</a>")

This basically says "compile a new expression into a regular expression object. This will match any piece of text that contains the characters <a href=", followed by 1 or more characters of anything (a . represents anything, a plus symbol represents one or more of the character it preceeded by, and a ? signifies non-greedy matching), followed by the text " class="topictitle">, then 1 or more characters of anything, followed by the characters ". We then need to download the page with the url that was specified. We're going to use urllib to do this.

page = urllib.urlopen(phpbb_page)

Now we want to get the domain name of the website the forum is residing on. To do so, we use a little regular expression trickery on the forum url we received.

phpbb_url = re.sub("viewforum\.php.+?$", "", phpbb_page)

This is translated as "set the variable phpbb_url to the value of phpbb_page with all occurences of viewforum.php, followed by one or more occurance of anything up to the end of the string (signified by a dollar symbol), with nothing"

We have pretty much all the data we need, so we just need to loop through all the lines in the webpage we downloaded, look for our regular expression 'reg', and then extract the href part of the anchor tag and append it to our rss code, along with what we find in between the a tags.

for line in page.readlines():
# print "Finding regexp..."
if(reg.findall(line)):
url = reg.findall(line)[0]
title = url
url = re.sub("<a href=\"", "", url)
url = re.sub("\".+$", "", url)

title = re.sub("<a.+?>", "", title)
title = re.sub("</a>$", "", title)
rss += " <item>\n"
rss += " <title>" + title + "</title>\n"
rss += " <link>" + phpbb_url + url + "</link>\n"
rss += " </item>\n"

We just extract the values we need from the line we're scanning if we find a match to our regular expression 'reg', and appending them to the rss variable, with the proper tags specified (item, title and link).

Finally, we just close up some tags in our rss variable, and send it to stdout.

rss += " </channel>\n"
rss += "</rss>"
print rss

If you couldn't really care about making this program and are just happy that someone else has, you can used a slightly modified version utilizing mod_python which passes the page name and title (and optionally description) through the url. http:/⁄ceruleanwave.dotmod.net/phpbb2rss.py?p=<PAGENAME>&t=<TOPIC>

For example, http:/⁄ceruleanwave.dotmod.net/phpbb2rss.py?p=http://www.programmers-corner.com/forums/viewforum.php?f=11&t=Python

I would put up a nice form for generating the feed, but as my host is moving datacenters _again_, I can't at the moment.

Hope you enjoyed the tutorial. Full source below.

#!/usr/bin/python
import urllib, re, string

"""
Convert an PHPBB forum page into an RSS feed.
"""

print "Content-Type: application/rss+xml"
print

#
phpbb_page = "http://programmers-corner.com/forums/viewforum.php?f=1"
phpbb_pagetitle = "The Lounge"
phpbb_pagedescr = "The Lounge - discuss coding or anything else you want"
#

rss = """<?xml version="1.0" encoding="iso-8859-1"?>
<!-- Generator:http://cerulean.pyresoft.com/blosxom.cgi/programs/phpbb2rss -->
<rss version="2.0">
<channel>
<title>""" + phpbb_pagetitle + """</title>
<link>""" + phpbb_page + """</link>
<description>""" + phpbb_pagedescr + """</description>

"""
reg = re.compile("<a href=\".+?\" class=\"topictitle\">.+?</a>")
page = urllib.urlopen(phpbb_page)
phpbb_url = re.sub("viewforum\.php.+?$", "", phpbb_page)
for line in page.readlines():
# print "Finding regexp..."
if(reg.findall(line)):
url = reg.findall(line)[0]
title = url
url = re.sub("<a href=\"", "", url)
url = re.sub("\".+$", "", url)

title = re.sub("<a.+?>", "", title)
title = re.sub("</a>$", "", title)
rss += " <item>\n"
rss += " <title>" + title + "</title>\n"
rss += " <link>" + phpbb_url + url + "</link>\n"
rss += " </item>\n"
rss += " </channel>\n"
rss += "</rss>"
print rss

Author Information:

Paul Giannaros

http://www.pyresoft.com

pyre@pyresoft.com

Comments:

Add your comments here.

Name

Comment

You can also send feedback to feedback@programmers-corner.com

There are currently no comments available.

 















 


© 2008-2009 dotnet4all.com