Explained: How to Get Domain From URL in Python

There are various circumstances in which you might need to extract the domain from a URL while working with web data in Python.

For operations like online scraping, data analysis, and website security, obtaining the domain can be crucial.

Three techniques for obtaining the domain from a URL in Python will be covered in this article: Regex, the tldextract module, and the urlparse module.

Advertising links are marked with *. We receive a small commission on sales, nothing changes for you.

Understanding URLs and Domains in Python

Explained: How to Get Domain From URL in Python

URLs, which are made up of multiple components such as the scheme, host, and path, are used to access web pages and other online resources.

The domain is an important aspect of a URL and is commonly known as the website address that you enter into the address bar of your browser.

While constructing websites with Python, it is critical to extract the domain from the URL. The extraction of the domain is essential for tasks such as web scraping, data analysis, and security.

The extraction of domains from URLs assures that the data obtained during web scraping initiatives comes from a trustworthy source.

Furthermore, collecting domains from URLs aids in the detection and prevention of phishing attacks.

We will describe these concepts in plain terms throughout this text, avoiding technical jargon as much as possible.

Our goal is to provide a straightforward, understandable explanation of how to extract domains from URLs using Python.

Tip: Find out, if your URL is valid with python (blog post).

3 Popular Python Techniques for Extracting Domains from URLs

You might frequently need to extract the domain from a given URL as a Python web developer. Python fortunately offers various options for doing this.

Three popular techniques for obtaining domains from URLs in Python will be covered in this section along with useful examples.

#1: Using the urlparse Module

Using the included Python package urlparse is one simple method for obtaining the domain from a URL. This module offers several functions for breaking down URLs into their constituent parts, such as the domain.

Here is an illustration of code that uses urlparse to extract the domain from a URL:

from urllib.parse import urlparse

url = "https://www.example.com/path/to/page.html"
parsed_url = urlparse(url)
domain = parsed_url.netloc

print(domain) # Output: "www.example.com"

The urlparse module is first imported into the code above from the urllib.parse library.

The URL from which we want to extract the domain is then defined.

The URL is broken down into its component parts using the urlparse function, and the domain is extracted from the netloc attribute.

#2: Using Regular Expressions

You can extract the domain from a URL by matching it with a specified pattern if you prefer regular expressions (regex).

If you need to extract the domain from a URL that doesn’t adhere to a conventional format, this method can be quite helpful.

Here is an example:

import re

url = "https://www.example.com/path/to/page.html"
pattern = r"(?P<domain>[\w\-]+\.+[\w\-]+)"
match = re.search(pattern, url)
domain = match.group("domain")

print(domain) # Output: "www.example.com"

The re module, which offers functionality for regular expressions, is first imported in the code above.

Following that, we specify the URL from which we wish to extract the domain and develop a regex pattern that recognizes domain names. We locate the pattern in the URL using the re.search function, then we utilize the named group (?P) to extract the domain.

#3: Utilizing the tldextract Module

The tldextract module can also be used to extract the domain from a URL. This third-party library’s ability to extract a URL’s top-level domain (TLD), domain name, and subdomains makes it very helpful.

Here is an example of some code that uses tldextract to extract the domain from a URL:

import tldextract

url = "https://www.example.com/path/to/page.html"
extracted = tldextract.extract(url)
domain = extracted.domain

print(domain) # Output: "example"

The tldextract module, which has a parse function for removing domain components from a URL, is first imported in the code above.

The URL from which we wish to extract the domain is specified, and the exact function is used to extract the domain’s constituent parts. Using the extracted.domain attribute, we extract the domain name.

Examples and Use Cases in Real Life

This section will examine real-world applications and use cases for Python’s domain extraction from URLs. We’ll talk about these things:

  • Scraping the web
  • Analysis of Data
  • Security Software

With useful code samples and explanations, each use case will have its own chapter.

#1: Web Scraping

One of the most popular uses for removing domains from URLs in Python is web scraping. It’s crucial to confirm that the data being scraped from the web came from a reliable source.

You may quickly determine if a domain is valid or not by removing the domain from the URL. Also, you can exclude illegitimate domains using the domain information and only scrape data from reputable sources.

Here is an illustration of some sample web scraping code that shows how to retrieve the domain from a list of URLs:

from urllib.parse import urlparse

# list of URLs to scrape
urls = ["https://www.example.com/page1.html", "https://www.example.com/page2.html", "https://www.notlegit.com/page3.html"]

# extract domains from URLs
domains = []
for url in urls:
    parsed_url = urlparse(url)
    domain = parsed_url.netloc
    domains.append(domain)

# filter out non-legit domains
legit_domains = []
for domain in domains:
    if "example.com" in domain:
        legit_domains.append(domain)

# scrape only from legit domains
for url in urls:
    if urlparse(url).netloc in legit_domains:
        # scrape data from this URL
        pass

The urlparse module is used in the code above to extract the domain names from the list of URLs to scrape. Afterwards, we exclude invalid domains and only scrape information from URLs with valid domains.

#2: Data Analysis

Organizing data by domain is frequently helpful when studying web data. For instance, you might want to examine online traffic by domain to identify the most popular domains.

Example:

data = [
    {"url": "https://www.example.com/page1.html", "views": 100},
    {"url": "https://www.example.com/page2.html", "views": 200},
    {"url": "https://www.notlegit.com/page3.html", "views": 50},
    {"url": "https://www.otherexample.com/page4.html", "views": 300},
]

# group data by domain
domains = {}
for d in data:
    domain = urlparse(d["url"]).netloc
    if domain not in domains:
        domains[domain] = {"views": 0, "pages": []}
    domains[domain]["views"] += d["views"]
    domains[domain]["pages"].append(d["url"])

# print results
for domain, data in domains.items():
    print(domain, data["views"], data["pages"])

We define a list of data objects in the code above, each of which has a URL and several views.

We group the data by domain after extracting the domain from each URL using the urlparse function. The total views and pages for each domain are then printed.

#3: Security Applications

Security applications can also benefit from domain extraction from URLs. One such application is determining whether the domain in a URL is real or not in order to identify phishing attacks.

Below is an example of code that shows how to determine whether a URL is trustworthy or not:

from urllib.parse import urlparse

def is_legit_url(url):
    # extract domain from URL
    domain = urlparse(url).netloc
    
    # check if domain is legitimate
    legit_domains = ["example.com", "google.com", "amazon.com"]
    if domain in legit_domains:
        return True
    else:
        return False

With the aforementioned code, we build a function that accepts a URL as input and uses the urlparse module to retrieve the domain.

The domain is then compared to a list of legitimate domains to see if it is legitimate. The function returns True if the domain is authentic. Otherwise, False is returned.

The detection and prevention of cross-site scripting (XSS) assaults is another security use of domain extraction from URLs. You may make sure that scripts are only run on valid pages by removing the domain from a URL and contrasting it with the domain of the currently-viewed page.

Here is some sample code that illustrates how to stop XSS attacks:

from flask import Flask, request
from urllib.parse import urlparse

app = Flask(__name__)

@app.route("/submit", methods=["POST"])
def submit():
    url = request.form.get("url")
    domain = urlparse(url).netloc
    
    # check if domain is the same as current page
    current_domain = urlparse(request.referrer).netloc
    if domain != current_domain:
        return "Error: Cross-site scripting (XSS) attack detected!"
    
    # submit data
    pass

We define a Flask app with a route for POST request data submission in the code above.

Using the urlparse module, we extract the domain from the submitted URL and compare it to the domain of the currently shown page using the referrer parameter of the request object.

To avoid a possible XSS attack, we return an error message if the domains do not match.

Looking for a step-by-step guide to learn coding?

Conclusion

In this article, we looked at many Python scripts for extracting domains from URLs.

We’ve discussed how crucial it is to be able to determine the domain from a URL for applications like web scraping, data analysis, security, and others. We’ve examined the urlparse module, regex, and the tldextract module as three popular approaches for extracting domains from URLs in Python.

For each technique, we’ve also included real-world examples and use cases, such spotting phishing scams and avoiding cross-site scripting (XSS) assaults.

You can extract domains from URLs and utilize them for several things by implementing these methods in your Python programs.

In conclusion, web developers and data analysts can benefit from learning how to extract domains from URLs in Python.

You can now increase the effectiveness and security of your Python projects by applying the techniques presented in this article 😎.

Advertising links are marked with *. We receive a small commission on sales, nothing changes for you.