CamKode

Create a simple sitemap generator with Python

Avatar of Kosal Ang

Kosal Ang

Wed Jul 27 2022

Create a simple sitemap generator with Python

Photo by: Holly Mandarich

Sitemap is very important for ranking your website in any search engines. In this article I would like to show you the basic about how to create sitemap generator by using Python.

Libraries using

Requests

Requests is one of the most downloaded Python packages today. Requests allows you to send HTTP/1.1 requests extremely easily. There's no need to manually add query strings to your URLs, or to form-encode your PUT & POST data - but nowadays, just use the json method!

To install requests from PyPI

1pip install requests
2

BeautifulSoup

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

To install BeautifulSoup from PyPI

1pip install beautifulsoup4
2

html5lib

html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.

To install html5lib from PyPI

1pip install html5lib
2

Functionality:

  • Function clean() to filter and clean the extracted links
1def clean(a_eles):
2    links = []
3    skip_links = []
4    for a in a_eles:
5        link = a['href']
6        if link.startswith('#') or link.startswith('mailto:') or link == '/':
7            skip_links.append(link)
8            continue
9
10        if link.startswith('/'):
11            link = '{}{}'.format(base_url, link)
12
13        if link.startswith('http://') != True and link.startswith('https://') != True:
14            link = '{}/{}'.format(base_url, link)
15
16        if link.startswith(base_url) is False:
17            continue
18
19        if link not in links:
20            links.append(link)
21
22    return [links, skip_links]
23
  • Function get_next_scan_urls() filters out the URLs that haven't been scanned yet.
1def get_next_scan_urls(urls):
2    links = []
3    for u in urls:
4        if u not in scanned:
5            links.append(u)
6    return links
7
  • Function scan() function performs the actual crawling by sending requests to each URL, extracting links, and recursively scanning those links if they haven't been scanned before.
1def scan(url):
2    if url not in scanned:
3        print('Scan url: {}'.format(url))
4        scanned.append(url)
5        data = requests.get(url)
6        soup = BeautifulSoup(data.text, 'html5lib')
7        a_eles = soup.find_all('a', href=True)
8        links, skip_links = clean(a_eles)
9
10        next_scan_urls = get_next_scan_urls(links)
11        print('Count next scan: {}'.format(len(next_scan_urls)))
12        if len(next_scan_urls) != 0:
13            for l in next_scan_urls:
14                scan(l)
15    return scanned
16
  • The main() function initiates the scanning process with the given website.
1def main():
2    links = scan(website)
3
4    urls = ''
5    for l in links:
6        urls += f"""
7    <url>
8      <loc>{l}</loc>
9      <lastmod>2022-07-27T02:24:08.242Z</lastmod>
10      <priority>0.6</priority>
11    </url>
12        """
13
14    xml = f"""
15<?xml version="1.0" encoding="UTF-8"?>
16<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
17    {urls}
18</urlset>
19    """
20
21    f = open('sitemap.xml', 'w')
22    f.write(xml)
23    f.close()
24

Sitemap Generation:

  • After scanning, the script generates an XML sitemap by formatting the URLs obtained during the scanning process.
  • It formats each URL with , , , and tags adhering to the sitemap XML structure.

Writing to File:

Finally, it writes the generated XML content into a file named sitemap.xml.

Full Code

1#!/usr/bin/env python
2
3import requests
4from bs4 import BeautifulSoup
5
6website = 'https://www.khmernokor.com'
7base_url = website
8if website.endswith('/'):
9    base_url = website[:-1]
10
11scanned = []
12
13
14def clean(a_eles):
15    links = []
16    skip_links = []
17    for a in a_eles:
18        link = a['href']
19        if link.startswith('#') or link.startswith('mailto:') or link == '/':
20            skip_links.append(link)
21            continue
22
23        if link.startswith('/'):
24            link = '{}{}'.format(base_url, link)
25
26        if link.startswith('http://') != True and link.startswith('https://') != True:
27            link = '{}/{}'.format(base_url, link)
28
29        if link.startswith(base_url) is False:
30            continue
31
32        if link not in links:
33            links.append(link)
34
35    return [links, skip_links]
36
37
38def get_next_scan_urls(urls):
39    links = []
40    for u in urls:
41        if u not in scanned:
42            links.append(u)
43    return links
44
45
46def scan(url):
47    if url not in scanned:
48        print('Scan url: {}'.format(url))
49        scanned.append(url)
50        data = requests.get(url)
51        soup = BeautifulSoup(data.text, 'html5lib')
52        a_eles = soup.find_all('a', href=True)
53        links, skip_links = clean(a_eles)
54
55        next_scan_urls = get_next_scan_urls(links)
56        print('Count next scan: {}'.format(len(next_scan_urls)))
57        if len(next_scan_urls) != 0:
58            for l in next_scan_urls:
59                scan(l)
60    return scanned
61
62
63def main():
64    links = scan(website)
65
66    urls = ''
67    for l in links:
68        urls += f"""
69    <url>
70      <loc>{l}</loc>
71      <lastmod>2022-07-27T02:24:08.242Z</lastmod>
72      <priority>0.6</priority>
73    </url>
74        """
75
76    xml = f"""
77<?xml version="1.0" encoding="UTF-8"?>
78<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
79    {urls}
80</urlset>
81    """
82
83    f = open('sitemap.xml', 'w')
84    f.write(xml)
85    f.close()
86
87
88if __name__ == '__main__':
89    main()
90

Output:

1<?xml version="1.0" encoding="UTF-8"?>
2<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
3    <url>
4      <loc>https://www.khmernokor.com</loc>
5      <lastmod>2022-07-27T02:24:08.242Z</lastmod>
6      <priority>0.6</priority>
7    </url>
8
9    <url>
10      <loc>https://www.khmernokor.com/question-answers/jiwn5gvqpu</loc>
11      <lastmod>2022-07-27T02:24:08.242Z</lastmod>
12      <priority>0.6</priority>
13    </url>
14
15    <url>
16      <loc>https://www.khmernokor.com/bun98</loc>
17      <lastmod>2022-07-27T02:24:08.242Z</lastmod>
18      <priority>0.6</priority>
19    </url>
20
21    <url>
22      <loc>https://www.khmernokor.com/yuravandy</loc>
23      <lastmod>2022-07-27T02:24:08.242Z</lastmod>
24      <priority>0.6</priority>
25    </url>
26    // ....
27</urlset>
28

Hope this article can give you some idea for generate sitemap.

Related Posts

How to Create and Use Virtual Environments

How to Create and Use Virtual Environments

Unlock the full potential of Python development with our comprehensive guide on creating and using virtual environments

Creating a Real-Time Chat Application with Flask and Socket.IO

Creating a Real-Time Chat Application with Flask and Socket.IO

Learn how to enhance your real-time chat application built with Flask and Socket.IO by displaying the Socket ID of the message sender alongside each message. With this feature, you can easily identify the owner of each message in the chat interface, improving user experience and facilitating debugging. Follow this step-by-step tutorial to integrate Socket ID display functionality into your chat application, empowering you with deeper insights into message origins.

How to Perform Asynchronous Programming with asyncio

How to Perform Asynchronous Programming with asyncio

Asynchronous programming with asyncio in Python allows you to write concurrent code that can handle multiple tasks concurrently, making it particularly useful for I/O-bound operations like web scraping

Mastering Data Visualization in Python with Matplotlib

Mastering Data Visualization in Python with Matplotlib

Unlock the full potential of Python for data visualization with Matplotlib. This comprehensive guide covers everything you need to know to create stunning visualizations, from basic plotting to advanced customization techniques.

Building a Secure Web Application with User Authentication Using Flask-Login

Building a Secure Web Application with User Authentication Using Flask-Login

Web authentication is a vital aspect of web development, ensuring that only authorized users can access protected resources. Flask, a lightweight web framework for Python, provides Flask-Login

Simplifying Excel File Handling in Python with Pandas

Simplifying Excel File Handling in Python with Pandas

Learn how to handle Excel files effortlessly in Python using the Pandas library. This comprehensive guide covers reading, writing, and manipulating Excel data with Pandas, empowering you to perform data analysis and reporting tasks efficiently.

Creating a Custom Login Form with CustomTkinter

Creating a Custom Login Form with CustomTkinter

In the realm of Python GUI development, Tkinter stands out as one of the most popular and versatile libraries. Its simplicity and ease of use make it an ideal choice for building graphical user interfaces for various applications.

Building Scalable Microservices Architecture with Python and Flask

Building Scalable Microservices Architecture with Python and Flask

Learn how to build a scalable microservices architecture using Python and Flask. This comprehensive guide covers setting up Flask for microservices, defining API endpoints, implementing communication between services, containerizing with Docker, deployment strategies, and more.

FastAPI: Building High-Performance RESTful APIs with Python

FastAPI: Building High-Performance RESTful APIs with Python

Learn how to leverage FastAPI, a modern web framework for building APIs with Python, to create high-performance and easy-to-maintain RESTful APIs. FastAPI combines speed, simplicity, and automatic documentation generation, making it an ideal choice for developers looking to rapidly develop and deploy APIs.

Beginner's Guide to Web Scraping with BeautifulSoup in Python

Beginner's Guide to Web Scraping with BeautifulSoup in Python

Learn how to scrape websites effortlessly using Python's BeautifulSoup library. This beginner-friendly guide walks you through fetching webpages, parsing HTML content, and extracting valuable data with ease.

© 2024 CamKode. All rights reserved

FacebookTwitterYouTube