How to scrape Reddit comments at scale.

Reddit comments are where the real signal lives. Product opinions, technical troubleshooting, policy debates, and market sentiment all surface in comment threads — often with more nuance and specificity than the posts themselves. For teams running NLP pipelines, building sentiment datasets, or monitoring brand perception, comment data is frequently the primary extraction target.

Scraping Reddit comments introduces challenges that post-level extraction does not: nested reply trees, deep pagination, "load more" boundaries, and high data volume per thread. This guide covers how Reddit structures comment data and the practical approaches to extracting it.

How Reddit comment data is structured

Every Reddit post has a comment tree. Comments are hierarchical — each reply is a child of the comment it responds to. Reddit represents this as a nested tree where each comment node contains metadata alongside its children.

A single comment object includes these fields:

body — the comment text content
author — the username of the commenter
score — net upvotes minus downvotes
created_utc — Unix timestamp of when the comment was posted
parent_id — the ID of the parent comment or post (prefixed with t1_ for comments, t3_ for posts)
depth — how many levels deep the comment is in the thread
is_submitter — whether the commenter is the original post author
distinguished — whether the comment is marked by a moderator or admin
permalink — direct URL to the comment

The depth of nesting is theoretically unlimited, but Reddit's interface and API cap the initially visible tree. Deeper branches are hidden behind "load more" or "continue this thread" links, which translate to separate API calls during extraction.

Extracting comments via the API

The Reddit API endpoint for comments is /r/{subreddit}/comments/{post_id}.json. It returns two arrays: one for the post itself and one for the comment tree. By default, the response includes top-level comments and a limited depth of replies.

To retrieve the full tree, you need to follow "more" objects in the response. These are stubs referencing comment IDs that were not included in the initial payload. The /api/morechildren endpoint accepts a batch of these IDs and returns the full comment objects.

Expanding collapsed branches requires calling the /api/morechildren endpoint with the referenced comment IDs. Each call returns a batch of full comment objects that you then insert into the tree at the correct position. For threads with thousands of comments, expanding all branches can consume a significant portion of the 100-requests-per-minute allowance.

Handling nested threads and reply depth

Large threads on popular posts can contain thousands of comments spread across dozens of nesting levels. The API returns a truncated tree by default, and expanding it requires multiple follow-up requests.

Practical considerations for deep thread extraction:

Each morechildren call counts against the rate limit — a thread with 5,000+ comments might require 50 or more follow-up requests
Processing time scales non-linearly with thread size due to API round-trips
Deeply nested comments (depth greater than 8) often contain less analytical signal — mostly conversational back-and-forth rather than substantive opinions
Setting a depth limit (5 to 8 levels) usually captures the core discussion without wasting requests

For most sentiment analysis, market research, and topic monitoring use cases, the top 5 to 6 levels of nesting contain the bulk of the useful data. Going deeper yields diminishing returns unless you need exhaustive coverage for academic research or compliance.

Pagination and "load more" boundaries

Reddit's "load more" mechanism is the biggest bottleneck for comment extraction. When a thread has more comments than the API returns in one response, the payload includes "more" objects — stubs that reference batches of hidden comment IDs.

Key limitations to plan around:

The morechildren endpoint accepts up to 100 comment IDs per call
A single "more" object can reference hundreds of IDs, requiring multiple calls to fully expand
Each call returns flat comment objects — you need to reconstruct the tree structure using parent_id references
There is no way to request a specific depth or branch of the tree directly — expansion is all-or-nothing per batch

For threads with 10,000+ comments, full expansion can take several minutes of API time. Planning your rate budget per thread — and deciding which threads warrant full expansion — is an important part of collection design.

Output format and field mapping

A well-structured comment dataset includes enough context to reconstruct thread relationships without needing to re-query Reddit. At minimum, each comment record should include:

comment_id — unique identifier
post_id — the parent post this comment belongs to
parent_id — direct parent (post or another comment)
depth — nesting level in the thread
author, body, score, created_utc — core content and metadata
subreddit — for multi-subreddit datasets

Flattening the tree into tabular format (CSV, Parquet) requires the parent_id and depth columns so downstream consumers can reconstruct the hierarchy if needed. JSON and JSONL preserve nesting natively but can be harder to query at scale in SQL-based pipelines.

For many NLP and sentiment analysis workloads, flat tabular output with a depth column is the most practical format — it loads directly into pandas, Spark, or BigQuery without custom parsing. For a broader look at extraction methods and how comments fit into the full Reddit data picture, see our guide on how to scrape Reddit.

Scaling comment extraction

Extracting comments from a single thread is straightforward. Extracting comments across thousands of threads — continuously, with full depth, and in a stable format — is an infrastructure problem. At scale, you are managing:

Rate limit allocation across hundreds or thousands of threads
Tree expansion scheduling with retry logic for failed branches
Deduplication when threads are re-scraped for updates
Format normalization across threads with different structures
Storage and delivery logistics for high-volume output
Monitoring for changes in Reddit's API behavior or response structure

This is where managed Reddit data extraction becomes practical. The extraction pipeline handles rate management, tree expansion, format normalization, and delivery scheduling — while your team works with clean comment datasets in the format your stack requires. Data ships to your cloud bucket in JSON, CSV, Parquet, or any custom schema.

Get a Reddit Comment Dataset Sample