Kosal Ang
Wed Jul 27 2022
Photo by: Holly Mandarich
Sitemap is very important for ranking your website in any search engines. In this article I would like to show you the basic about how to create sitemap generator by using Python.
Requests is one of the most downloaded Python packages today. Requests allows you to send HTTP/1.1 requests extremely easily. There's no need to manually add query strings to your URLs, or to form-encode your PUT
& POST
data - but nowadays, just use the json
method!
To install requests from PyPI
1pip install requests 2
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
To install BeautifulSoup from PyPI
1pip install beautifulsoup4 2
html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers.
To install html5lib from PyPI
1pip install html5lib 2
clean()
to filter and clean the extracted links1def clean(a_eles): 2 links = [] 3 skip_links = [] 4 for a in a_eles: 5 link = a['href'] 6 if link.startswith('#') or link.startswith('mailto:') or link == '/': 7 skip_links.append(link) 8 continue 9 10 if link.startswith('/'): 11 link = '{}{}'.format(base_url, link) 12 13 if link.startswith('http://') != True and link.startswith('https://') != True: 14 link = '{}/{}'.format(base_url, link) 15 16 if link.startswith(base_url) is False: 17 continue 18 19 if link not in links: 20 links.append(link) 21 22 return [links, skip_links] 23
get_next_scan_urls()
filters out the URLs that haven't been scanned yet.1def get_next_scan_urls(urls): 2 links = [] 3 for u in urls: 4 if u not in scanned: 5 links.append(u) 6 return links 7
scan()
function performs the actual crawling by sending requests to each URL, extracting links, and recursively scanning those links if they haven't been scanned before.1def scan(url): 2 if url not in scanned: 3 print('Scan url: {}'.format(url)) 4 scanned.append(url) 5 data = requests.get(url) 6 soup = BeautifulSoup(data.text, 'html5lib') 7 a_eles = soup.find_all('a', href=True) 8 links, skip_links = clean(a_eles) 9 10 next_scan_urls = get_next_scan_urls(links) 11 print('Count next scan: {}'.format(len(next_scan_urls))) 12 if len(next_scan_urls) != 0: 13 for l in next_scan_urls: 14 scan(l) 15 return scanned 16
main()
function initiates the scanning process with the given website.1def main(): 2 links = scan(website) 3 4 urls = '' 5 for l in links: 6 urls += f""" 7 <url> 8 <loc>{l}</loc> 9 <lastmod>2022-07-27T02:24:08.242Z</lastmod> 10 <priority>0.6</priority> 11 </url> 12 """ 13 14 xml = f""" 15<?xml version="1.0" encoding="UTF-8"?> 16<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 17 {urls} 18</urlset> 19 """ 20 21 f = open('sitemap.xml', 'w') 22 f.write(xml) 23 f.close() 24
Finally, it writes the generated XML content into a file named sitemap.xml.
1#!/usr/bin/env python 2 3import requests 4from bs4 import BeautifulSoup 5 6website = 'https://www.khmernokor.com' 7base_url = website 8if website.endswith('/'): 9 base_url = website[:-1] 10 11scanned = [] 12 13 14def clean(a_eles): 15 links = [] 16 skip_links = [] 17 for a in a_eles: 18 link = a['href'] 19 if link.startswith('#') or link.startswith('mailto:') or link == '/': 20 skip_links.append(link) 21 continue 22 23 if link.startswith('/'): 24 link = '{}{}'.format(base_url, link) 25 26 if link.startswith('http://') != True and link.startswith('https://') != True: 27 link = '{}/{}'.format(base_url, link) 28 29 if link.startswith(base_url) is False: 30 continue 31 32 if link not in links: 33 links.append(link) 34 35 return [links, skip_links] 36 37 38def get_next_scan_urls(urls): 39 links = [] 40 for u in urls: 41 if u not in scanned: 42 links.append(u) 43 return links 44 45 46def scan(url): 47 if url not in scanned: 48 print('Scan url: {}'.format(url)) 49 scanned.append(url) 50 data = requests.get(url) 51 soup = BeautifulSoup(data.text, 'html5lib') 52 a_eles = soup.find_all('a', href=True) 53 links, skip_links = clean(a_eles) 54 55 next_scan_urls = get_next_scan_urls(links) 56 print('Count next scan: {}'.format(len(next_scan_urls))) 57 if len(next_scan_urls) != 0: 58 for l in next_scan_urls: 59 scan(l) 60 return scanned 61 62 63def main(): 64 links = scan(website) 65 66 urls = '' 67 for l in links: 68 urls += f""" 69 <url> 70 <loc>{l}</loc> 71 <lastmod>2022-07-27T02:24:08.242Z</lastmod> 72 <priority>0.6</priority> 73 </url> 74 """ 75 76 xml = f""" 77<?xml version="1.0" encoding="UTF-8"?> 78<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 79 {urls} 80</urlset> 81 """ 82 83 f = open('sitemap.xml', 'w') 84 f.write(xml) 85 f.close() 86 87 88if __name__ == '__main__': 89 main() 90
Output:
1<?xml version="1.0" encoding="UTF-8"?> 2<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 3 <url> 4 <loc>https://www.khmernokor.com</loc> 5 <lastmod>2022-07-27T02:24:08.242Z</lastmod> 6 <priority>0.6</priority> 7 </url> 8 9 <url> 10 <loc>https://www.khmernokor.com/question-answers/jiwn5gvqpu</loc> 11 <lastmod>2022-07-27T02:24:08.242Z</lastmod> 12 <priority>0.6</priority> 13 </url> 14 15 <url> 16 <loc>https://www.khmernokor.com/bun98</loc> 17 <lastmod>2022-07-27T02:24:08.242Z</lastmod> 18 <priority>0.6</priority> 19 </url> 20 21 <url> 22 <loc>https://www.khmernokor.com/yuravandy</loc> 23 <lastmod>2022-07-27T02:24:08.242Z</lastmod> 24 <priority>0.6</priority> 25 </url> 26 // .... 27</urlset> 28
Hope this article can give you some idea for generate sitemap.
Unlock the full potential of Python development with our comprehensive guide on creating and using virtual environments
Learn how to enhance your real-time chat application built with Flask and Socket.IO by displaying the Socket ID of the message sender alongside each message. With this feature, you can easily identify the owner of each message in the chat interface, improving user experience and facilitating debugging. Follow this step-by-step tutorial to integrate Socket ID display functionality into your chat application, empowering you with deeper insights into message origins.
Asynchronous programming with asyncio in Python allows you to write concurrent code that can handle multiple tasks concurrently, making it particularly useful for I/O-bound operations like web scraping
Unlock the full potential of Python for data visualization with Matplotlib. This comprehensive guide covers everything you need to know to create stunning visualizations, from basic plotting to advanced customization techniques.
Web authentication is a vital aspect of web development, ensuring that only authorized users can access protected resources. Flask, a lightweight web framework for Python, provides Flask-Login
Learn how to handle Excel files effortlessly in Python using the Pandas library. This comprehensive guide covers reading, writing, and manipulating Excel data with Pandas, empowering you to perform data analysis and reporting tasks efficiently.
In the realm of Python GUI development, Tkinter stands out as one of the most popular and versatile libraries. Its simplicity and ease of use make it an ideal choice for building graphical user interfaces for various applications.
Learn how to build a scalable microservices architecture using Python and Flask. This comprehensive guide covers setting up Flask for microservices, defining API endpoints, implementing communication between services, containerizing with Docker, deployment strategies, and more.
Learn how to leverage FastAPI, a modern web framework for building APIs with Python, to create high-performance and easy-to-maintain RESTful APIs. FastAPI combines speed, simplicity, and automatic documentation generation, making it an ideal choice for developers looking to rapidly develop and deploy APIs.
Learn how to scrape websites effortlessly using Python's BeautifulSoup library. This beginner-friendly guide walks you through fetching webpages, parsing HTML content, and extracting valuable data with ease.