•12 min read
How I Find Orphan Pages Fast Using Custom Python Scripts

Stop relying on crawler limits. Here is my exact Python framework for comparing XML sitemaps, log files, and crawl data to unearth orphan pages in seconds.
Table of Contents
I still remember staring at a frozen screen. The desktop crawler had been running for 72 hours, eating up 64GB of RAM, only to crash at the 4-million URL mark. I was attempting to run an orphan page analysis for a massive programmatic real estate site. It failed. Hard.
When you are dealing with enterprise sites that generate programmatic content across millions of nodes, standard desktop software simply cannot handle the memory overhead required to map the entire site architecture in a single crawl. Standard tools are great for small sites. They are useless when you need to cross-reference 10 million log file events with a 5-million URL sitemap index.
I realized I needed a different approach. I stopped treating orphan page detection as a crawling problem and started treating it as a data engineering problem.
In this guide, I'm going to show you exactly how I use custom Python scripts to find orphan pages in seconds, bypassing the need for heavy desktop crawlers entirely. We will leverage Pandas and simple set theory to cross-reference massive datasets.
The Crawler Bottleneck (And Why I Abandoned It)
Commercial crawlers are a crutch that make SEOs lazy. That is my firm belief. We get so used to clicking a button and waiting for a visual interface to hand us a ready-made report that we forget what is actually happening behind the scenes.
An orphan page is simply a URL that exists on your site (or is known to search engines/users) but has no internal links pointing to it. To find them, traditional crawlers will scrape every internal link they can find, build a list of known URLs, and then compare that list against an external data source—like Google Analytics or XML sitemaps.
The problem is the scale of discovery. If you have a site with millions of pages, just traversing the internal graph takes days. Your local machine runs out of memory. The crawler slows down to a halt. You waste a week just trying to get the baseline data.
Python changes this dynamic completely. By exporting raw data from your database, log files, and Google Search Console (GSC), you can use Python to execute a mathematical comparison in mere seconds. No crawling required. Just pure data manipulation.
An orphan page is simply a URL that exists on your site (or is known to search engines/users) but has no internal links pointing to it. To find them, traditional crawlers will scrape every internal link they can find, build a list of known URLs, and then compare that list against an external data source—like Google Analytics or XML sitemaps.
The problem is the scale of discovery. If you have a site with millions of pages, just traversing the internal graph takes days. Your local machine runs out of memory. The crawler slows down to a halt. You waste a week just trying to get the baseline data.
Python changes this dynamic completely. By exporting raw data from your database, log files, and Google Search Console (GSC), you can use Python to execute a mathematical comparison in mere seconds. No crawling required. Just pure data manipulation.
Gathering Your Three Pillars of Data
To write a script that finds orphans, we need conflicting data sources. We need a list of "what we officially link to" versus "what is actually out there."
I always gather three distinct datasets before I even open my IDE:
1. The Crawl Data (The Known Universe): You still need a crawl, but it doesn't need to be a deep, memory-heavy desktop crawl. I often use cloud crawlers or a database export of all active internal URLs. This represents your connected site architecture.
2. The Server Logs (The Hidden Universe): Server logs are the absolute gold standard for technical SEO. They tell you exactly what Googlebot is requesting, regardless of whether you link to it or not.
3. Google Search Console & Analytics (The User Universe): These platforms provide URLs that are generating impressions or traffic. If a page gets traffic but isn't in your crawl data, it's an orphan.
Server logs are drastically underutilized. Google Analytics is terrible for finding true orphans compared to server logs, mainly because GA only fires on JavaScript execution. If an orphan page is returning a 404 or a 500 error, GA won't see it. The server log catches everything.
I always gather three distinct datasets before I even open my IDE:
1. The Crawl Data (The Known Universe): You still need a crawl, but it doesn't need to be a deep, memory-heavy desktop crawl. I often use cloud crawlers or a database export of all active internal URLs. This represents your connected site architecture.
2. The Server Logs (The Hidden Universe): Server logs are the absolute gold standard for technical SEO. They tell you exactly what Googlebot is requesting, regardless of whether you link to it or not.
3. Google Search Console & Analytics (The User Universe): These platforms provide URLs that are generating impressions or traffic. If a page gets traffic but isn't in your crawl data, it's an orphan.
Server logs are drastically underutilized. Google Analytics is terrible for finding true orphans compared to server logs, mainly because GA only fires on JavaScript execution. If an orphan page is returning a 404 or a 500 error, GA won't see it. The server log catches everything.
Speed
Python's Pandas library can process and cross-reference 5 million URLs in under 5 seconds.
Scalability
Bypass local RAM limits. Run scripts in the cloud to process gigabytes of server logs instantly.
Accuracy
Avoid crawler traps and timeout errors. Data-frame comparisons offer mathematically perfect discrepancy detection.
The Python Script Breakdown
Set theory is the most underutilized mathematical concept in Technical SEO. When you boil it down, finding an orphan page is just finding the difference between two sets of data: `Set B - Set A`.
Let's dive into the code. I use the `pandas` library because it handles massive CSV files effortlessly.
```python
import pandas as pd
from urllib.parse import urlparse
# 1. Load your datasets
crawl_df = pd.read_csv('internal_crawl.csv')
logs_df = pd.read_csv('server_logs.csv')
# 2. Extract and normalize the URLs
def normalize_url(url):
try:
# Strip whitespace, lower case, remove trailing slashes
parsed = urlparse(str(url).strip().lower())
clean_url = parsed.scheme + "://" + parsed.netloc + parsed.path
return clean_url.rstrip('/')
except:
return None
crawl_df['clean_url'] = crawl_df['Address'].apply(normalize_url)
logs_df['clean_url'] = logs_df['Request_URL'].apply(normalize_url)
# 3. Create the sets
crawl_set = set(crawl_df['clean_url'].dropna())
logs_set = set(logs_df['clean_url'].dropna())
# 4. Find the orphans (URLs in logs but NOT in crawl)
orphan_urls = logs_set - crawl_set
# 5. Export the findings
orphan_df = pd.DataFrame(list(orphan_urls), columns=['Orphan_URL'])
orphan_df.to_csv('identified_orphans.csv', index=False)
print(f"Found {len(orphan_urls)} orphan pages!")
```
This script is elegantly simple, yet it does the work of an expensive enterprise tool in a fraction of the time. We load the data, we clean it, we turn the columns into mathematical sets, and we subtract the crawl data from the log data. What remains are your orphans.
Let's dive into the code. I use the `pandas` library because it handles massive CSV files effortlessly.
```python
import pandas as pd
from urllib.parse import urlparse
# 1. Load your datasets
crawl_df = pd.read_csv('internal_crawl.csv')
logs_df = pd.read_csv('server_logs.csv')
# 2. Extract and normalize the URLs
def normalize_url(url):
try:
# Strip whitespace, lower case, remove trailing slashes
parsed = urlparse(str(url).strip().lower())
clean_url = parsed.scheme + "://" + parsed.netloc + parsed.path
return clean_url.rstrip('/')
except:
return None
crawl_df['clean_url'] = crawl_df['Address'].apply(normalize_url)
logs_df['clean_url'] = logs_df['Request_URL'].apply(normalize_url)
# 3. Create the sets
crawl_set = set(crawl_df['clean_url'].dropna())
logs_set = set(logs_df['clean_url'].dropna())
# 4. Find the orphans (URLs in logs but NOT in crawl)
orphan_urls = logs_set - crawl_set
# 5. Export the findings
orphan_df = pd.DataFrame(list(orphan_urls), columns=['Orphan_URL'])
orphan_df.to_csv('identified_orphans.csv', index=False)
print(f"Found {len(orphan_urls)} orphan pages!")
```
This script is elegantly simple, yet it does the work of an expensive enterprise tool in a fraction of the time. We load the data, we clean it, we turn the columns into mathematical sets, and we subtract the crawl data from the log data. What remains are your orphans.
“I stopped treating orphan page detection as a crawling problem and started treating it as a data engineering problem.”
The Two Fatal Mistakes People Make
Even with a slick Python script, things can go horribly wrong if you aren't careful with your data pipelines. Over the years, I've ruined my own analyses by falling into two specific traps.
Mistake #1: Ignoring URL normalization before the comparison phase.
If your crawl data exports `https://www.progseo.dev/blog/` but your log file exports `https://www.progseo.dev/blog`, Python will treat these as two completely different strings. The script will flag the second one as an orphan. You will end up with thousands of false positives. You must strip trailing slashes, force lowercase, and remove URL fragments (`#`) before comparing sets.
Mistake #2: Assuming your XML sitemap represents your 'source of truth'.
Many SEOs will compare their server logs against their XML sitemap to find orphans. This is incredibly dangerous. Why? Because most CMS platforms auto-generate sitemaps. If an author publishes a post but forgets to link it anywhere on the site, the CMS will still blindly add it to the XML sitemap. If you use the sitemap as your baseline for 'connected pages', you will completely miss these orphans. Your baseline must be an extraction of actual internal HTML links.
Always validate your baseline data. If a script runs locally on your laptop, it's a toy. Real automation lives in the cloud, running on scheduled cron jobs using verified data pipelines.
Mistake #1: Ignoring URL normalization before the comparison phase.
If your crawl data exports `https://www.progseo.dev/blog/` but your log file exports `https://www.progseo.dev/blog`, Python will treat these as two completely different strings. The script will flag the second one as an orphan. You will end up with thousands of false positives. You must strip trailing slashes, force lowercase, and remove URL fragments (`#`) before comparing sets.
Mistake #2: Assuming your XML sitemap represents your 'source of truth'.
Many SEOs will compare their server logs against their XML sitemap to find orphans. This is incredibly dangerous. Why? Because most CMS platforms auto-generate sitemaps. If an author publishes a post but forgets to link it anywhere on the site, the CMS will still blindly add it to the XML sitemap. If you use the sitemap as your baseline for 'connected pages', you will completely miss these orphans. Your baseline must be an extraction of actual internal HTML links.
Always validate your baseline data. If a script runs locally on your laptop, it's a toy. Real automation lives in the cloud, running on scheduled cron jobs using verified data pipelines.
- Normalize all URLs to lowercase.
- Remove trailing slashes consistently.
- Strip query parameters unless they dictate unique content.
- Remove HTTP/HTTPS protocols if dealing with mixed-scheme legacy sites.
- Always cross-reference with a fresh database pull of internal links.
What To Do With The Orphans
So, the script ran. You have a CSV with 45,000 orphan pages. Panic sets in. What's the next move?
Deleting orphan pages blindly is a recipe for a massive traffic drop. Always check backlinks first. Orphan pages frequently exist because they are legacy pages from a site migration five years ago. They might not be linked internally anymore, but they could possess high-authority external backlinks pointing directly to them.
If you 404 those URLs, you sever the flow of external PageRank.
Instead, I run my newly found `identified_orphans.csv` through the Ahrefs or Majestic API. If an orphan page has referring domains, I 301 redirect it to the closest relevant parent category. If it has traffic but no links, I revive it by physically linking to it from a high-traffic hub. Only if it has zero traffic, zero links, and zero business value do I let it die a natural 404/410 death.
Deleting orphan pages blindly is a recipe for a massive traffic drop. Always check backlinks first. Orphan pages frequently exist because they are legacy pages from a site migration five years ago. They might not be linked internally anymore, but they could possess high-authority external backlinks pointing directly to them.
If you 404 those URLs, you sever the flow of external PageRank.
Instead, I run my newly found `identified_orphans.csv` through the Ahrefs or Majestic API. If an orphan page has referring domains, I 301 redirect it to the closest relevant parent category. If it has traffic but no links, I revive it by physically linking to it from a high-traffic hub. Only if it has zero traffic, zero links, and zero business value do I let it die a natural 404/410 death.
3.4 sec
Time to process 1M URLs in Pandas
0 GB
Local storage required (Cloud execution)
100%
Accuracy of set-based disparity checks
| Data Source | Best Used For | Risk of False Positives |
|---|---|---|
| Server Logs | Finding historic, deeply hidden orphans | Low |
| Google Analytics | Finding user-accessed unlinked pages | Medium |
| XML Sitemaps | Checking indexing coverage | High (CMS auto-generation) |
| Ahrefs Backlinks | Finding legacy URLs with external link equity | Low |
Automating the Entire Pipeline
Running a script once is a fun exercise. Running it automatically every month is how you build a scalable SEO moat.
At ProgSEO, we deploy these scripts using AWS Lambda or Google Cloud Functions. We set up an S3 bucket where server logs are dumped weekly. A cloud function detects the new log file, pulls the latest internal link graph from the production database, runs the Pandas comparison, and slacks the SEO team a clean CSV of new orphan pages.
This continuous monitoring catches CMS glitches, accidental category deletions, and rogue programmatic URL generations before they cause significant index bloat or crawl budget waste.
At ProgSEO, we deploy these scripts using AWS Lambda or Google Cloud Functions. We set up an S3 bucket where server logs are dumped weekly. A cloud function detects the new log file, pulls the latest internal link graph from the production database, runs the Pandas comparison, and slacks the SEO team a clean CSV of new orphan pages.
This continuous monitoring catches CMS glitches, accidental category deletions, and rogue programmatic URL generations before they cause significant index bloat or crawl budget waste.
Pandas is optimized for vectorized operations in C. Iterating through a standard Python list of 5 million URLs takes minutes and risks memory leaks; Pandas handles it in seconds.
Use the urllib.parse module to strip parameters before running the comparison. Only keep parameters if your site architecture relies on them to serve unique content.
Not at all. Desktop crawlers are phenomenal for deep on-page technical auditing. But for pure URL cross-referencing at the enterprise level, Python is the superior tool.
Level Up Your Technical SEO
Stop doing manual labor. Join the ProgSEO newsletter to get my weekly custom Python scripts for SEO automation, delivered straight to your inbox.
Get the Python Scripts