How to scrape Reddit posts and threads.

Reddit posts are the primary unit of content on the platform. Each post has a title, optional body text or link, community votes, timestamps, and metadata about the author and subreddit. For data teams building topic monitoring, research datasets, or content intelligence pipelines, post-level data is often the starting point before drilling into comment-level extraction.

This guide covers the structure of Reddit post data, practical extraction methods, and approaches for bulk collection across multiple subreddits.

Reddit post data structure

A Reddit post (called a "submission" in the API) contains a standard set of fields. The most commonly extracted ones include:

id — unique post identifier (e.g., "abc123")
title — post title text
selftext — body content for text posts (empty for link posts)
url — linked URL for link posts, or the post's own URL for self-posts
score — net upvotes minus downvotes
num_comments — total comment count
author — username of the poster
subreddit — community name
created_utc — Unix timestamp of creation
permalink — relative URL path to the post

Additional metadata includes link_flair_text (subreddit-specific category tags), over_18 (NSFW flag), locked and stickied status, upvote_ratio, and domain (for link posts). Not all fields are present on every post — link posts lack selftext, and some posts may have null author if the account was deleted.

Fetching posts from a subreddit

The most common entry point for post extraction is the subreddit listing. Reddit offers several sort modes, each returning a different slice of the subreddit's content:

hot — posts ranked by a combination of recency and score (default view)
new — posts sorted by submission time, newest first
top — highest-scored posts within a time window (hour, day, week, month, year, all)
rising — posts gaining traction quickly
controversial — posts with a high volume of both upvotes and downvotes

Via the API, each request returns up to 100 posts. Pagination uses an "after" token — the fullname identifier of the last post in the current batch (formatted as t3_{id}) — to fetch the next page. For a broader look at extraction methods, see our guide on how to scrape Reddit.

Sorting, filtering, and pagination

Reddit's listings are capped at approximately 1,000 posts per sort mode, regardless of how many posts exist in the subreddit. This is an API-level constraint that affects all extraction methods, including direct API calls, web scraping, and .json endpoints.

To collect more than 1,000 posts from a single subreddit, you can:

Combine multiple sort modes — hot, new, top (across different time filters), rising, and controversial each return their own 1,000-item window
Use Reddit's search endpoint with time-restricted queries to page through historical data in date-range batches
Run regular scheduled pulls of the new listing to accumulate posts incrementally over time
Deduplicate across sort modes by post ID, since the same post can appear in multiple listings

For historical datasets spanning months or years, the search endpoint with CloudSearch-syntax timestamp filters (timestamp:start..end) is the most practical path. Batch your queries by week or month to stay within the per-query result limits.

Bulk collection across subreddits

Single-subreddit extraction is straightforward. Multi-subreddit collection introduces coordination challenges:

Rate limit budget must be shared across all target subreddits
Different subreddits have different posting volumes — a subreddit with 10 posts per day needs a different refresh schedule than one with 1,000
Deduplication is needed when posts are cross-posted across multiple subreddits
Output needs consistent formatting even when subreddits have different flair conventions and post types
Error handling per subreddit — one private or banned subreddit should not halt the entire collection pipeline

A practical approach is to maintain a subreddit target list with per-subreddit scheduling and rate allocation. Small, slow subreddits get checked less frequently. Large, active subreddits get more of the rate budget and shorter refresh intervals.

Storing and formatting post data

Reddit post data fits well in both tabular and document formats. For tabular output (CSV, Parquet), each post becomes a row with consistent columns. Fields with variable-length content like selftext may need truncation or should be stored alongside shorter metadata fields.

JSONL (one JSON object per line) is well-suited for incremental collection — each line is a self-contained record, which simplifies appending new batches and streaming ingestion into data warehouses. It also handles nested fields (media_metadata, flair objects) more naturally than flat CSV.

For bulk post collection at production scale — across many subreddits, with full metadata, on a recurring schedule — a Reddit post extraction service handles the orchestration, rate management, deduplication, and delivery pipeline. Data ships to your cloud bucket in JSON, CSV, Parquet, or any format your analytics stack requires.

Get a Reddit Post Dataset Sample