•12 min read
The Automated Keyword Clustering Script I Run Weekly

Stop grouping keywords manually. Here is the exact automated keyword clustering script I use weekly to scale my SEO content strategy without the headache.
My automated keyword clustering script runs every Sunday at 2 AM, and it is the single reason I haven't lost my mind managing millions of search queries. If you are reading this, you probably know the pain. You export a massive list from Ahrefs or Semrush. You open Excel. You stare at thousands of rows, trying to figure out which keywords belong together on a single page. It is tedious. It is soul-crushing. Most importantly, humans are terrible at it. We let our biases dictate structure. We assume words that look similar belong together. After three years of scaling ProgSEO, I realized that manual grouping was the biggest bottleneck in our entire content operation. I needed a system that operated continuously, not just during quarterly audits. Today, I'm pulling back the curtain on the exact Python script I use to process search data, prevent cannibalization, and keep my content pipeline flowing without manual intervention.
Table of Contents
- The Agony of Manual Grouping
- The Core Logic Behind My Script
- The Weekly Workflow & Delta Processing
- Step-by-Step Execution Logic
- Handling Anomalies and Edge Cases
- The Financial Impact of Automation
The Agony of Manual Grouping
Spreadsheets are where good SEO strategies go to die. That is my firm belief after auditing dozens of enterprise sites. When you sit down to manually group keywords, you are fundamentally guessing what Google considers equivalent intent. You look at the phrase best CRM software and top CRM platforms and assume they require the same page. Usually, your gut is right on the obvious ones. But when you start dealing with complex, long-tail modifiers, human intuition completely breaks down.
I see SEOs making the same critical errors over and over. Let's talk about the two massive mistakes people usually make when clustering.
Mistake 1: Relying on lexical similarity instead of SERP overlap. People often use spreadsheet formulas or basic scripts to group keywords that share the same root words. Just because two phrases share vocabulary doesn't mean Google serves the same results for them. Apple tree and apple watch share a word, but the intent is worlds apart.
Mistake 2: Delaying the clustering process until you have a 'massive' list. SEOs love to hoard keyword data. They wait until they have 100,000 keywords before doing a massive, exhausting clustering session. This is incredibly inefficient. It means your writers sit idle, or worse, they write overlapping content in the meantime because the data hasn't been organized yet.
I remember sitting in a dimly lit office back in 2018, staring at a CSV file with 45,000 rows. I was manually dragging and dropping cells. I got through 2,000 rows before realizing I had already created duplicate categories under different naming conventions. Human error is unavoidable at scale. If you are still doing this by hand, you aren't doing SEO. You are doing data entry.
I see SEOs making the same critical errors over and over. Let's talk about the two massive mistakes people usually make when clustering.
Mistake 1: Relying on lexical similarity instead of SERP overlap. People often use spreadsheet formulas or basic scripts to group keywords that share the same root words. Just because two phrases share vocabulary doesn't mean Google serves the same results for them. Apple tree and apple watch share a word, but the intent is worlds apart.
Mistake 2: Delaying the clustering process until you have a 'massive' list. SEOs love to hoard keyword data. They wait until they have 100,000 keywords before doing a massive, exhausting clustering session. This is incredibly inefficient. It means your writers sit idle, or worse, they write overlapping content in the meantime because the data hasn't been organized yet.
I remember sitting in a dimly lit office back in 2018, staring at a CSV file with 45,000 rows. I was manually dragging and dropping cells. I got through 2,000 rows before realizing I had already created duplicate categories under different naming conventions. Human error is unavoidable at scale. If you are still doing this by hand, you aren't doing SEO. You are doing data entry.
SERP-Based Logic
Groups keywords mathematically based on actual ranking URLs, eliminating human bias.
Continuous Delta Processing
Evaluates new keywords against existing live pages to prevent cannibalization before it happens.
Automated Parent Mapping
Automatically selects the keyword with the highest search volume as the primary H1 target.
The Core Logic Behind My Script
Let's dig into the actual mechanics of the code. My script does not try to be artificially intelligent. It tries to be ruthlessly pragmatic. I wrote the entire pipeline in Python because libraries like Pandas handle massive dataframes better than any other tool on earth.
The process starts when the script ingests a raw list of newly discovered keywords. It then loops through these queries and pings a SERP API. I personally use DataForSEO, but ScaleSERP or ValueSERP work beautifully as well. The script pulls the top 10 ranking URLs for every single query in the list.
Here is where the magic algorithm kicks in. The script compares the SERP results of every keyword against every other keyword. If 4 or more URLs match out of the top 10, the script groups them together. That threshold of 4 is the golden ratio. If you set it too high—say, 7 out of 10—you generate too many distinct, hyper-specific pages. This inevitably leads to thin content and cannibalization. If you set it too low, you end up smashing wildly distinct intents into monster pillar pages that never actually rank for anything specific.
A lot of junior Python SEOs try to get fancy with Natural Language Processing. They plug into the OpenAI API or use local models like BERT to cluster keywords based on semantic meaning. In my professional opinion, NLP-based clustering is wildly overrated for SEO compared to raw SERP overlap. Google's algorithm already did the billions of calculations required to figure out search intent. Why would you try to recreate Google's brain when you can just look at Google's output? SERP overlap is the only ground truth that actually matters. It removes the guesswork.
The process starts when the script ingests a raw list of newly discovered keywords. It then loops through these queries and pings a SERP API. I personally use DataForSEO, but ScaleSERP or ValueSERP work beautifully as well. The script pulls the top 10 ranking URLs for every single query in the list.
Here is where the magic algorithm kicks in. The script compares the SERP results of every keyword against every other keyword. If 4 or more URLs match out of the top 10, the script groups them together. That threshold of 4 is the golden ratio. If you set it too high—say, 7 out of 10—you generate too many distinct, hyper-specific pages. This inevitably leads to thin content and cannibalization. If you set it too low, you end up smashing wildly distinct intents into monster pillar pages that never actually rank for anything specific.
A lot of junior Python SEOs try to get fancy with Natural Language Processing. They plug into the OpenAI API or use local models like BERT to cluster keywords based on semantic meaning. In my professional opinion, NLP-based clustering is wildly overrated for SEO compared to raw SERP overlap. Google's algorithm already did the billions of calculations required to figure out search intent. Why would you try to recreate Google's brain when you can just look at Google's output? SERP overlap is the only ground truth that actually matters. It removes the guesswork.
“In my professional opinion, NLP-based clustering is wildly overrated for SEO compared to raw SERP overlap. SERP overlap is the only ground truth that actually matters.”
The Weekly Workflow & Delta Processing
Running a clustering script on a static list of 10,000 keywords is a fun weekend project. Running a production system that ingests 500 new keywords every week and figures out if they belong to existing clusters or require entirely new pages? That is the real engineering challenge. Most tutorials on the internet completely skip this part.
My Sunday 2 AM cron job executes what I call a delta check. During the week, my other automated systems scrape Google Search Console and run competitor gap analyses. This generates a fresh list of 'new' keywords we aren't currently targeting.
When the script wakes up on Sunday, it doesn't just compare these new keywords against each other. It compares their SERPs against a cached database of SERPs from my existing live pages. If a newly discovered keyword shares 4 URLs with a page I published six months ago, it does not get flagged for a new article. Instead, it gets appended to that existing page's update queue.
If you aren't clustering your new keywords against your existing live pages, your content team is probably cannibalizing your own site. You are essentially paying writers to create pages that will compete against your historical content in the SERPs. The weekly script catches this seamlessly. It turns what would have been a redundant, cannibalizing new article into a high-value H2 update for an existing piece of content. This delta processing is what allows a site to scale indefinitely without collapsing under its own weight.
My Sunday 2 AM cron job executes what I call a delta check. During the week, my other automated systems scrape Google Search Console and run competitor gap analyses. This generates a fresh list of 'new' keywords we aren't currently targeting.
When the script wakes up on Sunday, it doesn't just compare these new keywords against each other. It compares their SERPs against a cached database of SERPs from my existing live pages. If a newly discovered keyword shares 4 URLs with a page I published six months ago, it does not get flagged for a new article. Instead, it gets appended to that existing page's update queue.
If you aren't clustering your new keywords against your existing live pages, your content team is probably cannibalizing your own site. You are essentially paying writers to create pages that will compete against your historical content in the SERPs. The weekly script catches this seamlessly. It turns what would have been a redundant, cannibalizing new article into a high-value H2 update for an existing piece of content. This delta processing is what allows a site to scale indefinitely without collapsing under its own weight.
45
Hours Saved Per Month
120+
Cannibalization Issues Prevented Weekly
$2.50
Average API Cost Per Week
Step-by-Step Execution Logic
- Ingest the raw CSV of newly discovered keywords from Search Console and Ahrefs.
- Clean and normalize the queries (remove odd characters, strip trailing whitespace).
- Fetch the Top 10 SERP URLs for each keyword via a SERP API.
- Query the PostgreSQL database for the cached SERPs of existing live pages.
- Calculate the intersection sets between URL lists (comparing new vs. new, and new vs. existing).
- Group into clusters based on the >= 4 URL overlap threshold.
- Select the keyword with the highest search volume in each new cluster as the Parent Keyword.
- Export the cleanly mapped JSON directly into the content management pipeline.
Handling Anomalies and Edge Cases
No automated system is flawless, and anyone who tells you otherwise is trying to sell you software. When you run scripts at scale, you will inevitably encounter weird anomalies. Google frequently returns highly localized results for informational queries, which completely destroys the overlap ratio. Or you'll find split-intent SERPs, where the top 10 results are divided evenly between SaaS product pages and long-form definition articles.
My script handles this by assigning a confidence score to every cluster. When the script flags a cluster with a very low confidence score—for example, exactly 3 overlapping URLs but exceptionally high lexical similarity—it routes it to a manual review queue.
I spend maybe ten to fifteen minutes on Monday morning looking at these specific edge cases. It is a profound shift in my daily workflow. I am no longer building the house from scratch; I am just inspecting the plumbing.
I strongly believe that aiming for 100% automation is a fool's trap. The last 5% of accuracy takes 95% of the development time. If you try to code a script that handles every single edge case, you will spend your life maintaining regular expressions. Build the script to handle the bulk of the data flawlessly. Accept that a tiny bit of human intervention is still required for the strange, ambiguous queries that confuse even Google's engineers.
My script handles this by assigning a confidence score to every cluster. When the script flags a cluster with a very low confidence score—for example, exactly 3 overlapping URLs but exceptionally high lexical similarity—it routes it to a manual review queue.
I spend maybe ten to fifteen minutes on Monday morning looking at these specific edge cases. It is a profound shift in my daily workflow. I am no longer building the house from scratch; I am just inspecting the plumbing.
I strongly believe that aiming for 100% automation is a fool's trap. The last 5% of accuracy takes 95% of the development time. If you try to code a script that handles every single edge case, you will spend your life maintaining regular expressions. Build the script to handle the bulk of the data flawlessly. Accept that a tiny bit of human intervention is still required for the strange, ambiguous queries that confuse even Google's engineers.
| Method | Time Required | Accuracy (Intent) | Scalability |
|---|---|---|---|
| Manual Spreadsheet | Days | Variable (Human Error) | Zero |
| NLP Semantic Clustering | Minutes | Poor (Ignores SERP) | High |
| My SERP Overlap Script | Minutes | Excellent (Google's Truth) | High |
The Financial Impact of Automation
When you automate the foundational steps of your SEO strategy, the financial impact is staggering. Let's run the math. A standard SERP API costs around $1 per 1,000 queries. Processing my weekly delta of 500 keywords costs me exactly fifty cents.
If I were to hand those 500 keywords to a junior SEO analyst, it would take them at least four hours of tedious spreadsheet maneuvering to group them properly, check them against existing content, and format the output. At $30 an hour, that's $120 a week, or roughly $6,200 a year.
But the labor cost is just the tip of the iceberg. The true ROI comes from the mistakes that the script prevents. How much does a cannibalized silo cost your business? If two of your pages are fighting for the same high-intent software query, dropping you from position 2 to position 6, that could cost thousands of dollars in lost affiliate revenue or MRR every single month.
Once the data comes out of my script, it flows directly into our content management system. The parent keyword becomes the H1. The clustered secondary keywords become H2s, H3s, or semantic variations sprinkled naturally throughout the text. Because this runs reliably every week, our writers never have to guess what to work on next. They log in on Monday morning, open the newly generated briefs, and start outlining. The friction between keyword research and actual content creation drops to zero.
This is the real secret to programmatic SEO. It isn't just about generating thousands of pages. It is about ensuring every single page mathematically deserves to exist. By running this script on a continuous loop, my site architecture remains pristine, my writers stay busy, and my search traffic grows linearly.
If I were to hand those 500 keywords to a junior SEO analyst, it would take them at least four hours of tedious spreadsheet maneuvering to group them properly, check them against existing content, and format the output. At $30 an hour, that's $120 a week, or roughly $6,200 a year.
But the labor cost is just the tip of the iceberg. The true ROI comes from the mistakes that the script prevents. How much does a cannibalized silo cost your business? If two of your pages are fighting for the same high-intent software query, dropping you from position 2 to position 6, that could cost thousands of dollars in lost affiliate revenue or MRR every single month.
Once the data comes out of my script, it flows directly into our content management system. The parent keyword becomes the H1. The clustered secondary keywords become H2s, H3s, or semantic variations sprinkled naturally throughout the text. Because this runs reliably every week, our writers never have to guess what to work on next. They log in on Monday morning, open the newly generated briefs, and start outlining. The friction between keyword research and actual content creation drops to zero.
This is the real secret to programmatic SEO. It isn't just about generating thousands of pages. It is about ensuring every single page mathematically deserves to exist. By running this script on a continuous loop, my site architecture remains pristine, my writers stay busy, and my search traffic grows linearly.
I prefer DataForSEO for its cost-effectiveness at scale, but ValueSERP is extremely developer-friendly if you are just starting out.
Yes. While there are SaaS tools that offer SERP clustering, building your own Python script allows you to run weekly delta checks against your specific database of existing pages.
I use 4 out of 10 URLs. If you want tighter, more granular pages, increase it to 5 or 6. If you want massive pillar pages, drop it to 3.
Ready to Automate Your SEO?
Stop wrangling spreadsheets and start scaling your traffic. Check out our advanced guides on programmatic SEO and automation.
Read the Guides