Build a Reddit bot in Python for similar posts suggestion.

Introduction

RSPBot – source code.

In this post we are going to build Reddit bot for related posts suggestion. Think of it as “related” page for posts in subreddit. The idea is pretty simple: when someone creates a new post – the RSPBot will reply with a bunch of similar posts (if available).

Here is example for r/smallbusiness:

Original post:

Getting my small business bank account today. Should I go with a small credit union or a large bank?

Related posts:

Best small business bank account?

Bank recommendations for small business.

Best Small Business Bank

Which bank do you use for you small business and why?

Recommendations for Small Business Bank

Actually it’s a good way to measure how important this topic to the particular subreddit audience, what problem they are trying to solve and in what way.

But before moving to another topic I suggest to take a look at etiquette for reddit bots. This is a set of rules and suggestions what to do and what not with your bot. Don’t ignore those rules as banned bot is a dead sad bot 🙂

Scraping Reddit

Let’s think about the following workflow: rspbot monitor subreddit for new post submission, then it extracts post title and perform search for similar posts in the same subreddit, then reply with a list of related posts. Sounds as easy task, but think about the active subreddits where you receive large number of new submissions and then trying to search for post title, there is a chance that after some time you just flood Reddit with search API query.

What if we just scrape subreddit titles beforehand and make it as our database of post titles and create a mechanism to search though them and then reply with bunch of similar topics. Sounds great as it’s simplifies our workflow, so we could perform all post titles matching on our local machine.

Scraping reddit posts is pretty simple as there is a great API documentation, but we will use Python API wrapper –  PRAW, as it’s encapsulate all API query and provide easy to use programming interface. But what’s more important is that PRAW will split up your large request to multiple API calls each separated with some delay in order to not break Reddit API guidelines.

Here is excerpt from subreddit scraper source code:

def add_comment_tree(root_comment, all_comments):
    comment_prop = {'body': root_comment.body,
                    'ups': root_comment.ups}
    if root_comment.replies:
        comment_prop['comments'] = list()
        for reply in root_comment.replies:
            add_comment_tree(reply, comment_prop['comments'])
    all_comments.append(comment_prop)

def get_submission_output(submission):
    return {
            'permalink': submission.permalink,
            'title': submission.title,
            'created': submission.created,
            'url': submission.url,
            'body': submission.selftext,
            'ups': submission.ups,
            'comments': list()
        }

def save_submission(output, submission_id, output_path):
    # flush to file  with submission id as name
    out_file = os.path.join(output_path, submission_id + ".json")
    with open(out_file, "w") as fp:
        json.dump(output, fp)

def parse_subreddit(subreddit, output_path, include_comments=True):
    reddit = praw.Reddit(user_agent=auth['user_agent'], client_id=auth['client_id'],
        client_secret=auth['secret_id'])
    subreddit = reddit.subreddit(subreddit)
    submissions = subreddit.submissions()
    for submission in submissions:
        print("Working on ... ", submission.title)
        output = get_submission_output(submission)
        if include_comments:
            submission.comments.replace_more(limit=0)
            for comment in submission.comments:
                add_comment_tree(comment, output['comments'])
        # flush to file  with submission id as name
        save_submission(output, submission.id, output_path)

if __name__ == '__main__':
    parse_subreddit("smallbusiness", "/tmp/smallbusiness/", include_comments=False)<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

parse_subreddit method iterates over all posts in subreddit and extract all information defined in get_submission_output method with optional comments list and saves it to JSON file with file name as post ID.

If you don’t need comments, than just set include_comments to False as it speed up scraper significantly.

Make sure you have created app on Reddit and received your client ID and client secret.

Content-based recommendation engine

So, we have subreddit posts as list of JSON files with all information we need, now we need to build a mechanism for searching though files and extract similar posts as recommendation for user.

In rspbot we are converting text documents to a matrix of token occurrences and then use kernels as measures of similarity. So, basically we will use scikit-learn library class HashingVectorizer with linear_kernel method to get similarity matrix.

After we transformed list of post titles to a matrix of numbers we can save this matrix as binary file and then when running rspbot – just load this file and use it for getting similar posts. This will improve performance as we don’t need to parse the whole list of posts from JSON files to build matrix representation, we just load it from file and re-use from previous run.

Make sure you have basic understanding of how HashingVectorizer and linear_kernel works (check out scikit-learn tutorial on basic concepts) before moving to module source code.

Monitoring Reddit for new posts

Now we have set of post titles as JSON files and mechanism for matching similar posts, then next step would be to make a bot to listen for new submissions, get all related post if available and then reply with list of suggested topics.

With PRAW monitoring for new submissions is easy as writing for loop:


subreddit = reddit.subreddit('AskReddit')

for submission in subreddit.stream.submissions():

# do something with submission

Let’s summarize the whole bot “life” in a list of steps:

  1. Scrape subreddit to list of JSON files as related post suggestion database.
  2. Convert the list of JSON files to matrix of token occurrence for finding related posts. This step is performed only for the first time the bot is started, latter we just re-use existing matrix saved as binary file.
  3. Monitor for new submissions; in case of new submission – convert to matrix of token occurrence and combine it with the “old” matrix; search for related posts – if found – display list of titles with URL’s.
  4. Profit! 🙂