Top

Entries in productivity (1)

Monday
18Jan2010

Delicious Curation

In the first decade of 2000, effort on information retrieval was focused on discovery of good sources. As people found information sources they enjoyed and respected, data formats such as RSS were utilized as well as tools such as Delicious came to be. Towards the end of the decade, information sources were plenty, content was widely available from traditional sources such as news media companies as well as blogs. Twitter and Facebook also came to be and provided a mechanism to exchange information amongst trusted circles, your friends.

The challenge I face today consists of 200 RSS feeds, 50 some odd Delicious subscriptions, Twitter links, Facebook links, and on rare occasions, links via email. Any post that I feel I will read will be sent to Instapaper, which has become my list of good reads. Inundated with information, I need a way to curate these data sources so that I am only looking at a handful of hot items in a given topics.

The largest time kill would have to be feeds from Delicious. Amongst any given tag, there can be redundant posts, within any given tag, a post can appear everyday. Delicious does not track what you have seen, as it focuses on what's newly bookmarked and what's popular. Using Google Reader or NetNewsWire to consume Delicious feeds becomes an endless battle because of these reasons.

As an initial step in improving my information gathering process, I've created a very basic python script to grab Delicious feeds, track posts that have been downloaded before, and remove duplicates. My hope is that this will greatly reduce the time spent scanning through Delicious feeds. The script will tack interested tags, persist posts to a local SQLite3 database, and export an RSS file that NetNewsWire consumes.

The next step is going to be to do the same for existing RSS feeds, make the curation process into a Django application and look into having Shaun Inman's Fever application add a layer of what are hot posts to it.

I've pasted the python code below, feel free to take.

#!/usr/bin/env python
import urllib2
import datetime
import sqlite3
import PyRSS2Gen
import time

feed = 'http://feeds.delicious.com/v2/json/tag/%s?plain&count=25'

""" Array of tags interested in. """
tags = ['python']

class CurateDB:
	""" Database wrapper of feeds """
	
	def __init__(self):
		""" Establish connection to db and setup tables if need be """
		self.conn = sqlite3.connect('curate.db',isolation_level=None,
			detect_types=sqlite3.PARSE_DECLTYPES|sqlite3.PARSE_COLNAMES)
		self.cur = self.conn.cursor()
		
		""" Create a table that maps tags to date last updated """
		self.cur.execute('CREATE TABLE IF NOT EXISTS TAGS (TAG VARCHAR(200) NOT NULL PRIMARY KEY, LAST_UPDATED TIMESTAMP)')
		self.cur.execute('CREATE TABLE IF NOT EXISTS LINKS (URL TEXT NOT NULL PRIMARY KEY, TITLE TEXT, HITS INT NOT NULL)')
	
	def get_last_updated(self, tag):
		""" Retrieve the last time a tag was updated """
		self.cur.execute('SELECT LAST_UPDATED [timestamp] FROM TAGS WHERE TAG = ?', (tag,))
		row = self.cur.fetchone()
		if row:
			return row[0]
		else:
			return None
			
	def set_last_updated(self, tag, updated=datetime.datetime.now()):
		""" Update last updated timestamp """
		if self.get_last_updated(tag):
			self.cur.execute('UPDATE TAGS SET LAST_UPDATED = ? WHERE TAG = ?', (updated, tag))
		else:
			self.cur.execute('INSERT INTO TAGS(TAG,LAST_UPDATED) VALUES (?,?)', (tag, updated))
	
	def add_post(self, post):
		""" Add a post entry if one does not already exist """
		self.cur.execute('SELECT URL FROM LINKS WHERE URL = ?', (post.link,))
		row = self.cur.fetchone()
		if row:
			print 'Already saved link %s' % post.link
			self.cur.execute('UPDATE LINKS SET HITS = HITS + 1 WHERE TITLE = ?', (post.link,))
			return False
		else:
			self.cur.execute('INSERT INTO LINKS(URL,TITLE,HITS) VALUES (?,?,1)', (post.link, post.title))
			print 'Saved link %s' % post.link
			return True
	
	def shutdown(self):
		""" Clean up resources of sqlite3 db """
		self.cur.close()
		self.conn.close()

class Post:
	""" Represents a Delicious post 
		
		link - The URL to the post
		title - Title of the post
		date - Date the post was bookmarked
	"""
	def __init__(self, item):
		""" Unmarshall a JSON encoded item """
		self.link = item['u'].replace('\\','')
		self.title = item['d'].replace('\\','')
		self.date = datetime.datetime.strptime(item['dt'], "%Y-%m-%dT%H:%M:%SZ")
		
	def is_newer(self, adate):
		if self.date > adate:
			return True
		else:
			return False
	
	def __repr__(self):
		return self.link
		
def get_posts(tag):
	""" Retrieve recent posts from Delicious """
	url = feed % tag
	req = urllib2.Request(url)
	res = urllib2.urlopen(req)
	return map(Post, eval(res.read()))

db = CurateDB()
items = []

min_date = datetime.datetime.now() - datetime.timedelta(days=2)
try:	
	for tag in tags:
		print 'Finding recent posts for %s' % tag
		last_updated = db.get_last_updated(tag)
		if not last_updated:
			last_updated = min_date
		posts = get_posts(tag)
		for post in posts:
			if post.is_newer(last_updated):
				""" Only review posts after last updated """
				if db.add_post(post):
					items.append(PyRSS2Gen.RSSItem(title=post.title,link=post.link,guid=PyRSS2Gen.Guid(post.link),pubDate=post.date))
		""" Update last updated tag """
		db.set_last_updated(tag)
		""" Play nice with Delicious feeds, don't hammer the service """
		time.sleep(1)
finally:
	db.shutdown()

if len(items) > 0:
	print 'Found %d new items' % len(items)
	rss = PyRSS2Gen.RSS2(title='Delicious Feed',link='http://localhost/',
		description='Recent posts from Delicious',lastBuildDate=datetime.datetime.now(),items=items)
	rss.write_xml(open('delicious.xml','w'))