← All Posts

How to Parse an XML Sitemap (Extract All URLs)

Published on February 7, 2026

What Is a Sitemap?

A sitemap is an XML file that lists the URLs of a website along with optional metadata about each page. It acts as a roadmap for search engine crawlers, telling them which pages exist, how often they change, and when they were last modified. Sitemaps are one of the most important tools in technical SEO, helping search engines discover and index content efficiently.

Beyond SEO, sitemaps are incredibly useful for web scraping and data collection. Instead of crawling an entire website link by link, you can parse the sitemap to get a complete list of URLs in seconds. This approach is faster, more reliable, and puts less load on the target server compared to recursive crawling.

XML Sitemap Format

A standard XML sitemap follows the Sitemaps Protocol (sitemaps.org). The root element is <urlset>, which contains one or more <url> elements. Each URL entry includes the page location and optional metadata.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-02-01</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2026-01-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://example.com/blog/my-post</loc>
    <lastmod>2026-02-05</lastmod>
    <changefreq>weekly</changefreq>
    <priority>0.6</priority>
  </url>
</urlset>

The <loc> tag is the only required element and contains the absolute URL of the page. The <lastmod> tag indicates when the page was last modified. The <changefreq> tag suggests how often the page changes (always, hourly, daily, weekly, monthly, yearly, never), and <priority> indicates the relative importance of the page within the site on a scale from 0.0 to 1.0.

Sitemap Index Files

Large websites with thousands or millions of pages cannot fit all URLs into a single sitemap file. The Sitemaps Protocol limits each sitemap to 50,000 URLs and 50 MB (uncompressed). To handle this, websites use sitemap index files that reference multiple individual sitemaps.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-02-07</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-02-06</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-categories.xml</loc>
    <lastmod>2026-01-30</lastmod>
  </sitemap>
</sitemapindex>

When parsing a sitemap, you first need to determine whether you are dealing with a regular sitemap or an index file. If it is an index, you need to fetch and parse each referenced sitemap individually. Many sitemaps are also gzip-compressed (with a .xml.gz extension) to save bandwidth.

Finding a Website's Sitemap

There are several ways to find a website's sitemap. The most common locations are:

# Common sitemap locations
https://example.com/sitemap.xml
https://example.com/sitemap_index.xml
https://example.com/sitemap/sitemap.xml

# Check robots.txt for sitemap declarations
https://example.com/robots.txt
# Look for: Sitemap: https://example.com/sitemap.xml

The robots.txt file is the most reliable source. The Sitemaps Protocol specifies that sitemaps should be declared in robots.txt using the Sitemap: directive. Many CMS platforms like WordPress, Shopify, and Squarespace automatically generate sitemaps at predictable URLs.

SEO Importance of Sitemaps

Sitemaps play a critical role in search engine optimization. They help search engines discover new pages quickly, especially on large sites where internal linking may not reach every page. For new websites with few external links, a sitemap can be the primary way Google discovers your content.

Sitemaps also communicate which pages are most important through the priority attribute and when content was last updated through the lastmod attribute. While search engines treat these as hints rather than directives, accurate lastmod dates can help search engines prioritize recrawling recently updated content. Sites with frequently changing content benefit significantly from well-maintained sitemaps.

How to Use the Sitemap Parser

Open the Sitemap Parser and follow these steps:

  1. Enter a sitemap URL — paste the full URL of any XML sitemap or sitemap index file.
  2. View extracted URLs — the tool parses the XML and displays all URLs with their metadata in a clean table.
  3. Handle sitemap indexes — if the URL points to a sitemap index, the tool lists all referenced sitemaps for you to explore.
  4. Download the results — export the URL list as a text file or use it for further analysis and scraping.

The sitemap parser is ideal for SEO audits, competitive analysis, and building URL lists for web scraping projects. It handles both regular sitemaps and sitemap index files automatically.

Ready to parse a sitemap? Open Sitemap Parser

Need to extract data from websites?

PulpMiner turns any webpage into a structured JSON API. No scraping code needed — just point, click, and get clean data.

Try PulpMiner Free

No credit card required