Create Multiple URLs Using Each Path of a URL in Python
Before we start creating multiple URLs using each path of a URL, let's first understand the concept of URL paths. The path of a URL is the part of the web address that follows the domain. For example, in the URL https://example.com/blog/post-1, the path is /blog/post-1.
Hers's an example code that takes an input URL, splits its path, and generates a list of related URLs based on the hierarchical structure of the path:
from urllib.parse import urlparse
def create_multiple_urls_from_url_path(url):
results = []
base_url = f"{urlparse(url).scheme}://{urlparse(url).netloc}"
path = urlparse(url).path
dirs_position = [pos for pos, char in enumerate(path) if char == "/"]
for i in dirs_position:
results.append(base_url+path[0:i+1])
return results
url_list = create_multiple_urls_from_url_path("https://www.example/home/dashboard/profile/index.php")
print(url_list)
In this example, the code begins by importing the urlparse function from the urllib.parse module. Next, a custom function named create_multiple_urls_from_url_path is defined. This function takes an input URL as its parameter and is designed to return a list of related URLs. Inside the function, an empty list called results is initialized. This list will be used to store the generated URLs. The input URL is then divided into its components: the base URL, which includes the scheme and netloc, and the path. These components are extracted using the urlparse function. The code proceeds to identify the positions of forward slashes within the path, which signify different directory levels. This is achieved by creating a list called dirs_position using list comprehension. For each position of a forward slash in the path, the script constructs a new URL by combining the base URL with a portion of the path. These newly generated URLs are appended to the results list. The function ultimately returns this list of generated URLs to the caller. In practical terms, the code is useful for web scraping or navigation tasks, as it allows you to explore and access different levels of a website's content. The generated list of URLs can be used for data retrieval, site structure analysis, and more.
The output of the above code is as follows:
['https://www.example/', 'https://www.example/home/', 'https://www.example/home/dashboard/', 'https://www.example/home/dashboard/profile/']