RS3 - "RSS Made Digestable"

RSS Summarizer / Text-to-Speech Synthesizer / Podcaster

Licensed under the GNU GPL

What's RS3 About?

The biggest time wasting activity of traditional RSS aggregators
(at least for me) is having to go through each of the links in every
one of my RSS feed's items to read the full stories of the items I am
interested in. With RS3, a concise and pretty darn accurate
key sentence summary of each article is automatically gathered for me.
Instead of reading 30 headlines and 20 full articles, for example, I just have to read
(up to) 20 summaries. The time savings of
having to read a full RSS article is dramatically cut, while still being able to retain
the knowledge and ideas that an article is trying to convey.
RS3 can even generate ogg files of spoken text that summarize each of the articles
in your RSS feeds. The generated audio files and playlist make it even easier and less
intrusive to your daily activities to get the news and updated RSS articles you want.
These ogg files and playlists can even be copied to your favorite mp3 player (a-la podcasting)
for you to listen to during your drive to/from work, or just at your leisure!
Sound cool? Read on...

Check out this Almost-Live Example to see what RS3 looks like.
Please excuse the scary-sounding voice. I'm working on it! :)

What is RS3?

RS3 is short for "RSS Summary" or "RS-3" (any of these are fine).
RS3 in its current incarnation is a bash script
written for Linux that takes a set of RSS feeds of your choice
and downloads the full link that each of the RSS items links to.
That page is then converted to flat text to weed out
the ad banners, images, popups, and other junk that surrounds
the "meat" of an article.

Next, the text of the article is run through the GNU OTS utility
(GNU Open Text Summarizer) which produces a summary that only
contains the most important sentences from each article. The full
article is broken down into bite-sized digestable paragraphs that
summarize all of the new news items and blog entries in your RSS feeds.

NEW!! I've just added the ability
to convert the text summary to SPEECH using the festival project. The
summaries are converted to speech in the ogg format and then a playlist
is generated from those files which can be copied to an iPod, or other
media player.
This is the start of automated news/articles/blogs/forums podcasting!!

Finally, a single HTML page is presented in your favorite browser for
you to peruse through and just view the summaries of all of the new
articles in your feed. From the browser, you can even listen to the spoken
audio files summarizing the articles if you'd like!

How Standard RSS Aggregators Work

1. Setup a list of your favorite RSS feeds.
2. "Synchronize/Update" - Get new articles from each feed.
3. Read each article's headline and maybe the description.
4. If article sounds interesting, click on the link to the original article and read the full article.
5. Repeat.

How You Read RSS feeds with RS3

1. Setup a list of your favorite RSS feeds.
2. Run the RS3 script.
3a. Read a single page of headlines and summaries avoiding all those pesky
ad banners, popups, images and other junk, or
3b. Get a podcasted playlist of the article summaries converted to speech.
4. Repeat.

Requirements

Currently, RS3 requires Linux, bash, perl with xml DOM support, festival, ots, sed, grep, lynx, and wget.
I will note that all of these came with my SuSE distro and are pretty common.
You'll also need some additional voices for festival if you'd like to get
voice variations in your play-lists. Right now, the voices I use are ...
cmu_us_awb_arctic_hts, cmu_us_bdl_arctic_hts, cmu_us_jmk_arctic_hts, cmu_us_slt_arctic_hts
which are all available from here. Look for the section
entitled "HTS voices for Festival" and follow the README instructions there.

Current Feature Set

- Generates accurate article summaries for about a dozen sites
- Customizable scraping and summarizing based on regular expression matching against a URL
- Customizable and expandable set of acceptable feeds to read
- 2-way "spam" filtering for unwanted RSS items based on URL, and article content
- Flexible text-to-speech system (festival) used to convert summaries to speech
- Variations of voices allowed on a per url basis so articles from different sources sound different
- Command-line driven. Can be run from cron.
- Command line parameters allow verbosity control as well as speech synthesis control (on/off).
- For more frequent and sitting-in-front-of-a-PC updates, speech synthesis can be turned off so you can just read the summaries.
- Complete control of summarization process based on url regex matching

The Future, and my To-Do List

- My biggest peeve with this project is the voices used in festival. I'm looking for a
really good text to speech system that's open source and that sounds natural. This task
is definitely a full-time job and I just don't have time for it. Any volunteers??? :)
- A native C/C++ version or perl/php/python/ruby version with a nice GUI and X/Gnome/KDE integration.
- I would love to get this to automatically copy the playlist and ogg files to a media player.
- Support for MP3. Some Linux distros have issues with MP3 codecs, so I went with ogg for now.
- Customizable 'profiles' so I can run the script in the morning and evening and get voice files,
and then run throughout the day without voice files so I can manually read articles I want. This can be done now through command line parameters and cron jobs. - Better scraping of large articles like those from groklaw and distro security advisories on eweek.
- Be able to handle more sites and layouts. I wrote about 10 so far but I need much help with this!

Does this sound interesting? Do you think it has potential?
Give us a hand then! Check out rs3 on SourceForge.net
and download the latest tar.gz and have at it!
Or email me at alan8373@gmail.com

RS3 on SourceForge.net
SourceForge.net Logo
Check out this Semi-Live Example to see what RS3 looks like.



Bringing Business to Linux