RSS Feed Processing in Python

A while back, I wrote a utility called SyndiGram that downloaded a series of RSS feeds, formatted the last “n” post descriptions into an HTML file and then mailed the HTML file to me. The feeds were specified in a YAML file.

The library I was using didn’t always process the data very well, so I stopped using the utility and deleted the associated blog post.

Tonight, I wanted to resurrect the utility. This time, I tried to keep things even more simple. I do not use an external configuration file. Rather, the latter part of teh code contains a series of function-calls to a function named “process_feed” that accepts a feed URL, a number of entries to process, and a boolean that indicates whether or not to process the “summary” text. In some blogs, the “summary” text is the entirety of the post.

I used Python 2.7 as the host language. I used the “feedparser” library which made things extremely easy, although I did have to do some Unicode to ASCII conversion and I had to deal with the possibility of missing elements in each “item.”

I should note that I have no idea if all of these feeds in the sample below use the RSS format or some other syndication format. The feedparser library handles all of those differences for me.

Here’s the code … feeds.py:

# (python 2.7)
#
# Download my favorite site/blog feeds into an
# HTML file.
#
# Copyright (c) 2014 by Jim Lawless
# jimbo@radiks.net
# See MIT/X11 license at
# http://www.mailsend-online.com/license2014.php
#
# Uses the feedparser library. More info on this library
# can be found at https://wiki.python.org/moin/RssLibraries
#

import feedparser

def prologue():
    global fout
    fout=open("feeds.htm","w")
    fout.write("<html><head><title>Blog Feeds</title></head>\n")
    fout.write('<body style="font-size: 20;"">\n')

def epilogue():
    global fout
    fout.write("</body></html>\n")
    fout.close()

def print_entry(single):
    global fout
    fout.write(single + "\n")

def print_item(item):
    if item.has_key("date"):
        print_entry("Date : " + item["date"] + "<br>")
    if item.has_key("title"):
        print_entry("<a href=\"" + item["link"] + "\">" + 
            item["title"].encode("ascii","ignore") + "</a><br>")
    if item.has_key("summary"):
        print_entry("Summary:" + 
            item["summary"].encode("ascii","ignore")+"<br>")
    print_entry("<p>")


# First parameter is the feed url
# The second parameter is the maximum number of
#   entries to process.
# The third paramter is a boolean indicating whether
#   or not to show the summary information.
def process_feed(url,max,show_summary):
    feed = feedparser.parse( url)

    count = 0
    print "Processing " , feed["channel"]["title"]
    print_entry("<h2>" + feed["channel"]["title"] +"</h2>")
    for item in feed["items"]:
        count = count + 1
        if count <= max:
            if not show_summary:
                item["summary"]=""
            print_item(item)

prologue()

process_feed("http://feeds.feedburner.com/brandontreb", 5 , False)
process_feed("http://feeds.feedburner.com/HighScalability" , 5 , False)
process_feed("http://successfulsoftware.wordpress.com/feed" , 5 , True)
process_feed("http://feeds.feedburner.com/blogspot/hsDu" , 5 , False)
process_feed("http://www.kalzumeus.com/feed/" , 5 , True)
process_feed("http://www.dadhacker.com/blog/?feed=rss2" , 5 , True)
process_feed("http://modernjava.blogspot.com/feeds/posts/default" , 5 , False)

epilogue()

Under Windows, I run the script like this:

feeds.py

Processing  brandontreb.com
Processing  High Scalability
Processing  Successful Software
Processing  Android Developers Blog
Processing  Kalzumeus Software
Processing  Dadhacker
Processing  Modern Java

The above script produces a file named “feeds.htm” which I then open with a web browser.

My intent is to run this script once a day or so, possibly reducing the number of max entries for some sites in the configurable section of the script, which is here:

prologue()

process_feed("http://feeds.feedburner.com/brandontreb", 5 , False)
process_feed("http://feeds.feedburner.com/HighScalability" , 5 , False)
process_feed("http://successfulsoftware.wordpress.com/feed" , 5 , True)
process_feed("http://feeds.feedburner.com/blogspot/hsDu" , 5 , False)
process_feed("http://www.kalzumeus.com/feed/" , 5 , True)
process_feed("http://www.dadhacker.com/blog/?feed=rss2" , 5 , True)
process_feed("http://modernjava.blogspot.com/feeds/posts/default" , 5 , False)

epilogue()

I can then add a series of follow-up steps to send the HTML file via email should I desire to do so. I think that I will simply browse the HTML files, for now. If I need to keep archival copies of the processed blog feeds, then I will have to use email as an archive utility or I will have to devise an archival process.

Advertisements

About Jim Lawless

I've been programming computers for about 36 years ... 30 of that professionally. I've been a teacher, I've worked as a consultant, and have written articles here and there for publications like Dr. Dobbs Journal, The C/C++ Users Journal, Nuts and Volts, and others.
This entry was posted in Programming and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s