reddit announced last week that they’re bringing back the reddit.com beta testing program, and one of the interesting new features is the improved subreddit search. As the admins themselves admit, searching for subreddits has always been a major pain point, and the new search vastly improves the quality of results. I had been working on a subreddit search feature for as well, and right now seems like a good time to release it!

Let’s compare reddit’s old and new search algorithms — searching for “robots” using the old search gives us results with /r/DaftPunk and /r/plotholes pretty high up in the list, presumably because both these subreddits include the word “robots” in their descriptions. The new search for “robots” returns results that are a lot more relevant — /r/DaftPunk and /r/plotholes still appear, but they are preceded by subreddits that are actually about robots. Great!

Now, how can improve search results? One advantage it has is that it knows /r/DaftPunk is about music and /r/plotholes is about movies, thanks to the categorization of subreddits that I had written about earlier. And this comes in handy when searching for subreddits.

Let’s search for “robots” on . Thanks to stemming, it also includes results for “robotics”. Yay! Of course, not-entirely-relevant subreddits such as /r/DaftPunk, /r/plotholes and /r/robotchicken are included (and they should be, because they do have the word “robots” in them somwehere), but the search lets you restrict your results to particular topics — search for “robots topic:technology”, and only subreddits that are classified under Technology are returned.

Let’s look at a few more interesting examples:

The search also supports a small number of filters and operators that I hope you find useful:

  • “cats subscribers<5000” returns subreddits about cats that have fewer than 5000 subscribers, for when you are purposely looking for smaller subreddits.
  • “music created>2013-05-10” returns subreddits about music that were created within the past two years.
  • “hardcore over18:false” excludes 18+ subreddits from the results. Use “over18:true” if you want only 18+ subreddits returned — the search does not judge.
  • Common search operators:

It’s exciting to release this new feature, but it does have its limitations — it only searches subreddit metadata, not content in posts. The index is also currently limited to the 30K subreddits that I have data for (UPDATE 05/13: The index has now been updated to include over 800K subreddits, thanks /u/GoldenSights!) but I’m working hard on adding more and more subreddits.

Thanks for reading, and I hope you enjoy the new search feature. Feedback and bug reports are welcome!

As I’d written in my previous blog post, had recently analyzed over 100,000 users. Since then, more than 10,000 new users have analyzed their profiles, and hundreds of users are discovering the site every day. Thanks to this amazing growth, it is now possible to see how you compare to the average user.

now shows how better (or worse) your stats are than the average user’s. The average values will only get more accurate as the user base grows. The comparison is currently limited to average karma and unique words usage — if you’d like to see other kinds of comparisons, let me know!

SnoopSnoo stats comparison

went online on December 30, 2014 and since then, it’s been posted on several subreddits and even made it to the frontpage a couple of times. Over 100,000 users were analyzed, and they gave feedback more than 200,000 times! Here’s a brief look at the data from the past 80 days.

Users

Users analyzed 100,136
Users who gave feedback 21,337 (21% of users analyzed)

Feedback

Total feedback entries 213,954 Thank you, everyone!
Overall accuracy 83% Surprisingly higher than I expected!
Most accurate Gaming 91%
Least accurate Relationship partner 58%

It’s been really exciting to watch grow — keep an eye out for more features coming soon!


Bonus Picture Time: I was in Austin last weekend hanging out with the Google Cloud Platform team, and a small crowd formed around the table to check out . It was a lot of fun to see users’ reactions in person!

Show off your snoovatar

01 March 2015

In the latest edition of “Features Nobody Requested”, I am happy to announce that now displays your snoovatar (if you have gold and have set one up)! It looks like this:

SnoopSnoo now shows snoovatars!

As far as I know, reddit doesn’t give you access to snoovatars via the API and it’s not a matter of simply scraping a static image from your snoovatar page either, because it’s dynamically constructed using JavaScript and Canvas. So I wrote some simple code to generate it myself using the Python Imaging Library. If you come across any issues, please let me know.

The code is available as a Blockspring function if you’re interested — just enter your username and run the function to generate your snoovatar in PNG format. To use it in your own app, sign up for Blockspring and use your API key to call the function — they make it ridiculously simple to integrate APIs like these in your apps, check them out!

There are over 9,000 active subreddits on reddit, and according to reddit metrics, over 585,000 in total! There’s probably a subreddit for any topic you can think of, no matter how obscure or specific. (If not, you can create one yourself.) So there’s indeed a subreddit for that, for pretty much any value of that. That’s great, but how do you actually find new subreddits?

I think there are two interesting areas where there’s a lot of room for improvement — subreddit discovery and recommendations.

Subreddit Discovery

There are already a few resources to help you find subreddits:

  1. reddit’s own subreddit listing page.
  2. Meta subreddits such as /r/SubredditOfTheDay and /r/FindAReddit.
  3. Third-party websites such as reddit metrics and redditlist.

These are great if you already have a specific keyword in mind, or if you are simply interested in finding new and trending subreddits regardless of topic. But what if you wanted to browse subreddits by topic, like a dmoz-style directory? You could visit subreddits.org or metareddit.com, but they seem outdated. So I built a directory of subreddits.

A brief background: while building , I decided to manually group the top 2,500 subreddits by topics so I would be able to tell, at least at a high level, what subject areas a user was interested in. For example, it was straightforward to presume that a user’s activity in /r/python and /r/java meant that they were interested in “Technology > Programming”. The tedious (and painful) prerequisite was, of course, having to manually file /r/python and /r/java under “Programming” (and do this for each of the 2,500 subreddits). After several frustrating attempts, I managed to categorize most of the top 2,500 subreddits under several topics and continued to build the rest of the site. I then realized that I had already created a mini-directory subreddits grouped by topic!

The next logical step was to expand the directory by adding more subreddits, but doing it manually wasn’t going to scale. So, like any lazy good programmer, I automated the process:

  • For each of the top ~13,000 subreddits, I gathered a list of its related subreddits. I used sidebar links and crossposts to measure relativity between subreddits. This gave me a list of around 28,000 subreddits.
  • I filtered out subreddits with fewer than 1,000 subscribers to keep things simple, which shortened the list down to around 19,000 subreddits.
  • currently lets users suggest topics for subreddits that have none assigned. Using this helpful data along with the manual list I already had, I wrote a program to automatically assign topics to the remaining subreddits.
  • Subreddits that couldn’t be assigned a topic (such as self-post only subreddits that have no useful crossposts data) were assigned “General” by default.

And that’s how I built a directory of thousands of subreddits categorized by topic. You can check it out here and I hope you find it useful.

Subreddit Recommendations

Another problem that I find interesting is recommending subreddits based on user activity. Currently, reddit doesn’t seem to provide tailored subreddit recommendations to users based on their activity. For instance, if you are already active on /r/Cooking and /r/recipes, perhaps you would also like /r/AskCulinary or /r/budgetfood?

Since I now have enough data about subreddits , I figured it’d be worth adding subreddit recommendations to — it now shows recommendations for subreddits that you may like and also lets you vote on how useful you find them. As always, user feedback is extremely helpful in improving my algorithms and I look forward to all kinds of feedback, suggestions and criticism.

Making it better: While this recommendation technique (known as content-based filtering) is better than completely random choices, it is still very limited by nature. If I deduce that you are interested in cooking, I can only recommend subreddits related to cooking and food at best. A more useful recommendation system would also do what is known as collaborative filtering, where recommendations are derived using not only the type of content a user already likes, but also from new content liked by similar users. This is a harder problem to solve and requires a lot more data than I currently have, but it’s something that I hope to explore in the near future.

Version 0.2, of course. Since the public release of ten days ago, I’ve received lots of feedback that helped me make some important changes. I’ve just published v0.2 of the site, and I hope to continue to add features and improve accuracy. Here is the list of new features that went live today:

  • You can now refresh data to keep your profile up-to-date.
  • Client-side fetching of data to prevent “Server too busy” errors.
  • See which subreddits are most likely to bring you maximum karma.
  • Charts detailing last 60 days of activity.
  • All charts updated to use local timezone instead of UTC.
  • Help categorize your subreddits.
  • Marginally improved NLP.
  • Bug fixes and cosmetic changes.

Like the new features? Hate the changes? Want a new feature altogether? Let me know in the comments below, PM me on reddit or head over to /r/SnoopSnoo.

Hello reddit

05 January 2015

I had wanted to add several new features to before I released it publicly, such as letting users authenticate via OAuth, profile views and even more charts and analytics. But after realizing that it would never be truly ready for public release, I bought SnoopSnoo.com, uploaded the current version and posted the link on /r/InternetIsBeautiful and /r/dataisbeautiful. Both posts were received quite well, and I got a ton of useful feedback from redditors.

Someone suggested that I crosspost it to /r/secretsanta - I did and the response was excitingly positive. I hadn’t even considered this use case, so it was great to discover new ways people could use the site.

Now, on to actually making use of the feedback I received and make some changes!

Let's begin

01 January 2015

Hello, and welcome to ’s development blog. I’ve been on reddit since its inception way back in 2005 (mostly lurking, though) and have always thought it would be interesting to analyze the massive amounts of data it generates, especially after its huge growth in the past few years – so I built .

As of today, it simply does two things:

  • Aggregate your reddit submissions and comments, and graphically display resulting data.
  • Parse text in your comments and submissions, and extract relevant and potentially interesting information.

I built the web app in about three weeks, but I’ve been working on its NLP components for a couple of months now. I’m still new to NLP and as I try to wrap my head around it, results may often be erratic.

The site is built on Flask and hosted on the Google Cloud Platform. Charts are generated using D3.js. The NLP component is built on TextBlob and NLTK, and is hosted on Blockspring.

I plan to update this blog regularly as the site evolves. Really. I’ve even made a New Year’s resolution and everything.