How to scrape Reddit comments at scale.

Practical approaches to extracting Reddit comment data — from API methods to full-thread scraping — with format examples and rate limit workarounds.

Reddit comments are where the real signal lives. Product opinions, technical troubleshooting, policy debates, and market sentiment all surface in comment threads — often with more nuance and specificity than the posts themselves. For teams running NLP pipelines, building sentiment datasets, or monitoring brand perception, comment data is frequently the primary extraction target.

Scraping Reddit comments introduces challenges that post-level extraction does not: nested reply trees, deep pagination, "load more" boundaries, and high data volume per thread. This guide covers how Reddit structures comment data and the practical approaches to extracting it.

How Reddit comment data is structured

Every Reddit post has a comment tree. Comments are hierarchical — each reply is a child of the comment it responds to. Reddit represents this as a nested tree where each comment node contains metadata alongside its children.

A single comment object includes these fields:

  • body — the comment text content
  • author — the username of the commenter
  • score — net upvotes minus downvotes
  • created_utc — Unix timestamp of when the comment was posted
  • parent_id — the ID of the parent comment or post (prefixed with t1_ for comments, t3_ for posts)
  • depth — how many levels deep the comment is in the thread
  • is_submitter — whether the commenter is the original post author
  • distinguished — whether the comment is marked by a moderator or admin
  • permalink — direct URL to the comment

The depth of nesting is theoretically unlimited, but Reddit's interface and API cap the initially visible tree. Deeper branches are hidden behind "load more" or "continue this thread" links, which translate to separate API calls during extraction.

Extracting comments via the API

The Reddit API endpoint for comments is /r/{subreddit}/comments/{post_id}.json. It returns two arrays: one for the post itself and one for the comment tree. By default, the response includes top-level comments and a limited depth of replies.

To retrieve the full tree, you need to follow "more" objects in the response. These are stubs referencing comment IDs that were not included in the initial payload. The /api/morechildren endpoint accepts a batch of these IDs and returns the full comment objects.

Expanding collapsed branches requires calling the /api/morechildren endpoint with the referenced comment IDs. Each call returns a batch of full comment objects that you then insert into the tree at the correct position. For threads with thousands of comments, expanding all branches can consume a significant portion of the 100-requests-per-minute allowance.

Handling nested threads and reply depth

Large threads on popular posts can contain thousands of comments spread across dozens of nesting levels. The API returns a truncated tree by default, and expanding it requires multiple follow-up requests.

Practical considerations for deep thread extraction:

  • Each morechildren call counts against the rate limit — a thread with 5,000+ comments might require 50 or more follow-up requests
  • Processing time scales non-linearly with thread size due to API round-trips
  • Deeply nested comments (depth greater than 8) often contain less analytical signal — mostly conversational back-and-forth rather than substantive opinions
  • Setting a depth limit (5 to 8 levels) usually captures the core discussion without wasting requests

For most sentiment analysis, market research, and topic monitoring use cases, the top 5 to 6 levels of nesting contain the bulk of the useful data. Going deeper yields diminishing returns unless you need exhaustive coverage for academic research or compliance.

Pagination and "load more" boundaries

Reddit's "load more" mechanism is the biggest bottleneck for comment extraction. When a thread has more comments than the API returns in one response, the payload includes "more" objects — stubs that reference batches of hidden comment IDs.

Key limitations to plan around:

  • The morechildren endpoint accepts up to 100 comment IDs per call
  • A single "more" object can reference hundreds of IDs, requiring multiple calls to fully expand
  • Each call returns flat comment objects — you need to reconstruct the tree structure using parent_id references
  • There is no way to request a specific depth or branch of the tree directly — expansion is all-or-nothing per batch

For threads with 10,000+ comments, full expansion can take several minutes of API time. Planning your rate budget per thread — and deciding which threads warrant full expansion — is an important part of collection design.

Output format and field mapping

A well-structured comment dataset includes enough context to reconstruct thread relationships without needing to re-query Reddit. At minimum, each comment record should include:

  • comment_id — unique identifier
  • post_id — the parent post this comment belongs to
  • parent_id — direct parent (post or another comment)
  • depth — nesting level in the thread
  • author, body, score, created_utc — core content and metadata
  • subreddit — for multi-subreddit datasets

Flattening the tree into tabular format (CSV, Parquet) requires the parent_id and depth columns so downstream consumers can reconstruct the hierarchy if needed. JSON and JSONL preserve nesting natively but can be harder to query at scale in SQL-based pipelines.

For many NLP and sentiment analysis workloads, flat tabular output with a depth column is the most practical format — it loads directly into pandas, Spark, or BigQuery without custom parsing. For a broader look at extraction methods and how comments fit into the full Reddit data picture, see our guide on how to scrape Reddit.

Scaling comment extraction

Extracting comments from a single thread is straightforward. Extracting comments across thousands of threads — continuously, with full depth, and in a stable format — is an infrastructure problem. At scale, you are managing:

  • Rate limit allocation across hundreds or thousands of threads
  • Tree expansion scheduling with retry logic for failed branches
  • Deduplication when threads are re-scraped for updates
  • Format normalization across threads with different structures
  • Storage and delivery logistics for high-volume output
  • Monitoring for changes in Reddit's API behavior or response structure

This is where managed Reddit data extraction becomes practical. The extraction pipeline handles rate management, tree expansion, format normalization, and delivery scheduling — while your team works with clean comment datasets in the format your stack requires. Data ships to your cloud bucket in JSON, CSV, Parquet, or any custom schema.