The Threshing Floor

What is Thresh?

Every day, millions of people talk to each other on Reddit. They ask for help. They share what scares them. They argue about what matters. Most of it disappears into the scroll.

Thresh pulls it out.

Point it at any subreddit. Tell it what you're looking for. It hands you a clean dataset with a complete record of how it was collected. No accounts, no API keys, no code. Everything runs in your browser. Everything stays on your machine.

The threshing floor is among the oldest inventions in human history. High ground, flat stone, wind. The place where you brought the harvest and beat it until the grain fell free. Every civilization built one. Covenants were sealed on threshing floors. Altars were raised on them. Prophets used them to talk about judgment. They were never passive places. They were where you did the work of separation. The flood is information now, and the grain is what people are actually saying to each other when they think no one important is listening.

Quick Start

Enter a subreddit name and start collecting data in seconds. No API keys, no code, no setup.

How It Works

1 Thresh — Enter a subreddit and collection parameters

2 Harvest — Browse and search your collected data

3 Winnow — Analyze with word frequency charts and optional AI

4 Glean — Export data with provenance, or generate an AI research report

Recent Collections

No collections yet. Start by threshing a subreddit.

Who Uses This — And How

Public Health Researcher

"What are people in r/mentalhealth talking about this month?"

Thresh: r/mentalhealth • Top • Past month • 100 posts Harvest: Sort by score to find what resonates most Winnow: Run Identify themes to map dominant concerns Glean: Export CSV with anonymized usernames for IRB-ready analysis

Journalist

"What questions are people asking in r/personalfinance about student loans?"

Thresh: r/personalfinance • Top • Past week • keyword: "student loans" Harvest: Sort by num_comments for the biggest conversations Winnow: Run Extract questions to find what people need answered Glean: Provenance.txt gives your editor a transparent methodology section

Graduate Student

"I need to compare discourse in r/science vs. r/conspiracy for my thesis."

Thresh: science, conspiracy • Top • Past year • keyword: "vaccine" Harvest: Compare upvote_ratio to see consensus vs. division Winnow: Run Sentiment analysis on each, then a Custom prompt comparing tone Glean: Two exports, each with its own provenance — cite both in your methods section

Community Organizer

"What are residents saying in our city's subreddit about the new transit plan?"

Thresh: r/yourcity • New • Past month • keyword: "transit" Harvest: Enable comments to hear the full conversation, not just headlines Winnow: Run Summarize discussion to distill what people actually want Glean: JSON export feeds directly into your own tools or dashboards

Subreddit

Enter the subreddit name without the r/ prefix. You can enter multiple separated by commas.

Sort By

Top = highest-scored posts (best for research). Hot = trending now. New = most recent. Rising = gaining momentum. Controversial = most divided (low upvote_ratio).

Time Filter

Controls which posts are eligible. Past week is a good default. Use Past year or All time for broader studies.

Max Posts

Posts per subreddit. Start with 25 to preview, then collect 100 for full analysis.

Keyword Filter

Searches post titles. Leave empty to collect everything matching your sort/time settings.

Include top-level comments (slower)

Collects the top comments on each post (depth 2, up to 50 per post). Essential for discourse analysis; skip for headline-level surveys.

Collection

No Data Yet

Collect some Reddit data first, then come back here to explore it.

Understanding Your Data — What Each Field Means

score — Net votes (upvotes minus downvotes). High score = community resonance. A post with 500 score was upvoted roughly 500 more times than it was downvoted.

upvote_ratio — Fraction of votes that were upvotes (0.0 to 1.0). A ratio of 0.95 means near-unanimous approval. Below 0.60 means the post is divisive — this is what "Controversial" sort finds.

num_comments — Total comments on the post. High comments + low score often means debate. High comments + high score means broad engagement.

created_utc — When the post was made (UTC timestamp). Exported as both Unix timestamp and ISO date for your analysis tools.

is_self — True if the post is text (a "self-post"). False if it's a link to an external site. Self-posts contain the author's own writing in the selftext field.

selftext — The body text of a self-post. This is the primary content for text analysis, sentiment, and theme extraction.

link_flair_text — Category label set by subreddit moderators (e.g., "Discussion", "News", "Vent"). Useful for filtering by post type in your analysis.

author — Reddit username. Anonymized by default in exports to protect privacy. Shown here for browsing context.

domain — Source domain for link posts (e.g., "nytimes.com") or "self.subreddit" for text posts. Useful for tracking which sources a community shares.

Title	Author	Score	Comments	Date

Nothing to Export

Collect some data first, then return here to export it.

Collection

Format

CSV opens in Excel, Google Sheets, SPSS, and R. JSON is better for Python, JavaScript, or feeding into other tools.

Anonymize usernames

Replaces usernames with user_a1b2c3 hashes. Recommended for any published or shared work. Your choice is documented in provenance.txt.

What's in the Export ZIP?

posts.csv (or .json) — One row per post. Fields: id, subreddit, title, author, selftext, score, upvote_ratio, num_comments, created_utc, created_date, url, permalink, is_self, flair, domain.

comments.csv (if collected) — One row per comment. Fields: id, post_id, author, body, score, created_utc, created_date, depth, parent_id.

provenance.txt — Complete methodology record: tool version, subreddit(s), sort method, time filter, keyword, limits, post/comment counts, timestamp, anonymization setting, data source, known limitations, and ethical use guidance. Cite this directly in your methods section.

Want more than raw data? Scroll down to the AI Research Report section to generate a complete Introduction / Methods / Results / Discussion document from your collection. It combines everything: your metadata, statistics, word frequencies, and post content into a structured report you can build on.

AI Research Report

Requires API Key

Generate a complete, downloadable research report from your collection. Claude reads your data, computes patterns, and writes a structured Introduction / Methods / Results / Discussion report that you can use as a starting point for your own analysis, a briefing for your editor, or a draft methods section for your thesis.

How it works:

Answer two questions below so Claude can frame the report around your specific research interest.
Click "Generate Report" and Claude will aggregate your collection metadata, summary statistics, word frequencies, and post content into a structured report.
Download the result as a Markdown file. The report includes full provenance so anyone reading it knows exactly how the data was gathered.

What is your research question?

This frames the Introduction and Discussion. Be as specific as you can. A good research question names the community, the topic, and the timeframe.

Who is this report for?

Controls the tone and framing. Academic reports cite limitations and methods rigorously. Journalism reports lead with findings.

Brief context (optional)

Anything that helps Claude understand why you collected this data and what you plan to do with the findings.

No Data to Analyze Yet

Head over to Thresh to collect posts from a subreddit. Once you have a collection, come back here to explore word patterns and run AI analysis.

Built-in Analysis (No API Key Needed)

These tools run entirely in your browser at no cost:

Word Frequency Table — See the top 20 most common words across all post titles and bodies. Common stopwords ("the", "and", "is") are filtered out. Copyable with one click. Tells you what the community is literally talking about. Sortable Data Table — On the Harvest page, sort by any column. Sort by score for resonance, num_comments for engagement, or date for recency. Summary Statistics — Post count, average score, average comments, and date range are calculated automatically for every collection.

Claude AI Analysis (Requires API Key)

Claude reads your collected posts and produces structured research analysis. It goes beyond counting words — it understands meaning, groups ideas, and identifies patterns a human analyst would find.

Identify Themes — Groups posts into thematic clusters and names each theme. Perfect for a first pass on unfamiliar data. Example output: "Theme 1: Access to Care (23 posts) — Users describe long wait times, insurance denials…" Sentiment Analysis — Classifies overall tone (positive, negative, neutral, mixed) and identifies emotional patterns. Useful for tracking how a community feels about a topic over time. Summarize Discussion — Produces a structured summary: main points, areas of agreement, areas of disagreement, and standout observations. Good for briefings and literature reviews. Extract Questions — Pulls out the questions people are asking, categorized by topic. Reveals information gaps and unmet needs — invaluable for journalists and service providers. Custom Prompt — Ask Claude anything about your data. Your prompt is combined with a summary of up to 50 posts.

How to use this page:

Pick a collection from the dropdown below. The word frequency table will update immediately.
Review the word table — it shows you the 20 most common words in your data. Click "Copy" to grab it.
Optionally, run Claude AI analysis — choose an analysis type and click "Analyze with Claude." You'll need an Anthropic API key (click "API Key" to add one).
Copy or download the results. Use "Copy to Clipboard" for quick pasting, or "Save as .txt/.md" to keep a file for your records.

Collection

Selecting a collection updates the word frequency table below.

Analysis Type

Requires an Anthropic API key. The word table works without one.

Word Frequency

No API key needed

Top 20 words from all post titles and bodies, with common stopwords removed. This is the literal vocabulary of the conversation — the words people actually use when talking about this topic.

Select a collection above to see word frequencies.

A Letter from the Builder

I built this because I know what it feels like to drown.

Not in water. In data.

I spent years writing a dissertation on social media discourse during the COVID-19 pandemic. Collecting tweets. Analyzing what people were saying to each other while the world fell apart. And what I found was genuinely illuminating. The patterns were there. The fear, the misinformation, the solidarity, the grief, the stubborn and sometimes beautiful ways people tried to make sense of something that didn't make sense. It was some of the most meaningful work I've ever done.

But the process of getting there was brutal. Even for me. Even with training and tools and institutional support. It required programming. It required API credentials. It required understanding rate limits and data schemas and pagination logic and a dozen other things that had nothing to do with the actual research question I was trying to answer.

And I kept thinking: if this is this hard for me, it is impossible for most people. The teachers who sense something shifting in how their students talk about mental health online. The journalists who can feel a story forming in a subreddit but can't prove it yet. The community organizers. The grad students. They're locked out. Not because they lack the curiosity or the intellect, but because the tools weren't built for them.

So I built this.

It can do much of what my dissertation did. Smaller scale, because of Reddit's rate limits. But that's fine. Small is good, if pointed in the right place, focused with the right lens. A hundred posts from the right subreddit in the right week will tell you more about how a community thinks than a thousand posts scraped without intention.

The world right now is full of quantitative social science. AI has made it easy to count things, to run regressions, to produce charts at speed. I'm grateful for that. But this project aims at something different. It aims at breathing life back into a research method I hold dear: qualitative inquiry. The close reading. The careful listening. Not because I'm an expert in it. Exactly the opposite, really. I chose the quantitative path early in my training, and I have admired qualitative work from afar ever since, with the quiet reverence of someone who knows they took the other road. There is something mysterious in it. Something irreducible. The act of sitting with people's words and asking what do they mean rather than how many are there.

At the deepest core, my hope is that Thresh marries the two again. That it gives you the structure to collect carefully, the numbers to orient yourself, and then the space to read. Really read. What people are saying. That it empowers anyone interested in understanding public discourse to gather data more easily, more ethically, and more transparently than the old tools ever allowed.

The ultimate recursion is this: what was once my dissertation is now everybody's tool.

The waters rise. The feed is a flood. But the threshing floor is high ground, and the work of separation is ancient, and the grain is worth the labor.

With hope,

Jacob E. Thomas, MA, PhD

February 2026

What Is This?

The Threshing Floor is a free, open-source tool for collecting and exporting Reddit data. It is designed for public health researchers, journalists, civic technologists, and anyone who believes public discourse is worth measuring.

It runs entirely in your browser. There is no server, no database, no account to create. Your data stays on your machine.

How It Works

Reddit's public pages serve JSON data alongside HTML. The Threshing Floor fetches this public data through a lightweight proxy (to handle browser security restrictions), then lets you browse, filter, and export it.

No Reddit API key is required. No authentication of any kind. The data collected is limited to what any person could see by visiting Reddit in a web browser.

The workflow follows four steps. Each one builds on the last:

Thresh (Collect) — Enter a subreddit, choose your sort method and time filter, set a post limit, and optionally filter by keyword. Click Thresh. The tool gathers posts (and optionally comments) and stores them in your browser.
Harvest (Browse) — Explore your collected data in a sortable, searchable table. Click any post to read its full text and comments. Sort by score to see what resonates, by comments to see what sparks debate, or by date to follow the conversation over time.
Winnow (Analyze) — Start with the built-in word frequency table (free, instant, no API key). Then optionally run AI-powered analysis with Claude to identify themes, extract sentiment, summarize discussions, or ask custom questions about your data.
Glean (Export) — Download your data as CSV or JSON, bundled in a ZIP with a provenance.txt file that documents exactly how it was collected. Or generate an AI Research Report: answer two questions about your research, and Claude produces a complete Introduction / Methods / Results / Discussion document you can use as a starting point for your own work.

The AI Research Report

The Glean page includes an optional AI Research Report generator. It requires an Anthropic API key (the same one used for Winnow analysis).

Here's what it does: you provide a research question, choose your audience (academic, journalism, advocacy, or general), and optionally add context about why you collected this data. Claude then aggregates everything: your collection metadata, summary statistics, word frequency analysis, and the content of up to 50 posts. It writes a structured report with Introduction, Methods, Results, Discussion, and Provenance sections.

The report is a starting point, not a finished product. It gives you the structure and the language to build from. For a grad student, it's a draft methods section and results overview. For a journalist, it's a story pitch backed by data. For a community organizer, it's a briefing document with citations.

What was once a dissertation-level undertaking now takes ten minutes and a good question.

Your Data & Storage

Everything is stored in your browser only. There is no server database, no account system, and no cloud sync. Specifically:

Collections (posts, comments, configuration) are saved in your browser's localStorage.
Anthropic API key (if you use Claude AI analysis) is stored in localStorage. It is never sent to any server except Anthropic's API directly.
Rate limit state and subreddit cache are also stored locally in your browser.

This means your data does not sync across browsers or devices. If you switch browsers, your collections will not follow.

Clearing Your History

To erase all Thresh data from your browser:

Open your browser's Settings (or press Ctrl+Shift+Delete / Cmd+Shift+Delete)
Navigate to Privacy & Security → Clear browsing data
Select "Cookies and site data" (this includes localStorage)
To clear only Thresh: go to your browser's Developer Tools (F12), open the Application tab, expand Local Storage, find the Thresh site, and delete individual keys or click Clear All

This will remove all saved collections, your API key (if stored), rate limit state, and cached subreddit data. It cannot be undone.

The Rate Limit Gauge

The Rate Limit gauge at the bottom of the sidebar tracks how many requests remain in your current Reddit rate limit window. Reddit allows 100 requests per minute to its public JSON endpoints.

Gold bar — plenty of requests remaining. Normal operation.
Yellow bar (below 30%) — requests are running low. Consider pausing collection.
Red pulsing bar (below 10%) — critical. Thresh will pause automatically if the limit is reached.
Cooldown timer — if you hit the limit, a countdown appears showing when requests resume. The collect button is disabled until the cooldown expires.

The rate limit resets automatically each minute. Under normal use (25–100 posts per collection), you will rarely see it drop below gold.

Ethical Considerations

Re-identification risk: Even with usernames removed, unique writing styles or specific details in posts may allow re-identification. Consider this when publishing findings.
IRB guidance: If you are conducting academic research, consult your Institutional Review Board about whether your data collection constitutes human subjects research.
Reddit's Terms: This tool accesses publicly available data. Please review Reddit's Terms of Service and API Terms regarding data collection and use.
Consent: Reddit users post publicly, but they may not expect their posts to be aggregated and analyzed. Handle data with care and respect.
Default anonymization: Exports anonymize usernames by default. You can disable this, but consider the implications before doing so.

Provenance

Every export includes a provenance.txt file documenting exactly how the data was collected: the subreddit, sort method, time filter, number of posts, date of collection, and any filters applied. This is the seal on every bundle — it gives you the language for a methods section, a transparency report, or a replication attempt.

AI Analysis (Optional)

The Winnow page offers optional AI-powered analysis using Claude (by Anthropic). To use this feature, you need your own Anthropic API key. Your key is stored only in your browser's local storage and is sent directly to Anthropic's API — it is never stored on any server.

Deploying Your Own

The Threshing Floor deploys to Cloudflare Pages with zero configuration:

Fork the repository on GitHub
Connect it to Cloudflare Pages
Set the build output directory to public
Deploy

That's it. No environment variables, no build step, no dependencies to install.

Citation

Thomas, J.E. (2026). The Threshing Floor: A browser-based tool for Reddit data collection and export. https://github.com/jethomasphd/The_Threshing_Floor

A Jacob E. Thomas artifact. Built with deliberate attention.

The Floor

What is Thresh?

Quick Start

How It Works

Recent Collections

Who Uses This — And How

Public Health Researcher

Journalist

Graduate Student

Community Organizer

Step 1 — Thresh

Collecting…

Step 2 — Harvest

No Data Yet

Step 4 — Glean

Nothing to Export

Export Preview

AI Research Report

Research Report

Step 3 — Winnow

No Data to Analyze Yet

Built-in Analysis (No API Key Needed)

Claude AI Analysis (Requires API Key)

Word Frequency

Claude's Analysis

About

A Letter from the Builder

What Is This?

How It Works

The AI Research Report

Your Data & Storage

Clearing Your History

The Rate Limit Gauge

Ethical Considerations

Provenance

AI Analysis (Optional)

Deploying Your Own

Citation

The Floor

What is Thresh?

Quick Start

How It Works

Recent Collections

Who Uses This — And How

Public Health Researcher

Journalist

Graduate Student

Community Organizer

Step 1 — Thresh

Collecting…

Step 2 — Harvest

No Data Yet

Step 4 — Glean

Nothing to Export

Export Preview

AI Research Report

Research Report

Step 3 — Winnow

No Data to Analyze Yet

Built-in Analysis (No API Key Needed)

Claude AI Analysis (Requires API Key)

Word Frequency

Claude's Analysis

About

A Letter from the Builder

What Is This?

How It Works

The AI Research Report

Your Data & Storage

Clearing Your History

The Rate Limit Gauge

Ethical Considerations

Provenance

AI Analysis (Optional)

Deploying Your Own

Citation

Anthropic API Key