Name | Date | Size | #Lines | LOC | ||
---|---|---|---|---|---|---|
.. | 03-May-2022 | - | ||||
docs/ | H | 06-Oct-2021 | - | 353 | 70 | |
tests/unit/ | H | 06-Oct-2021 | - | 795 | 599 | |
urlextract/ | H | 06-Oct-2021 | - | 1,225 | 937 | |
urlextract.egg-info/ | H | 03-May-2022 | - | 177 | 123 | |
CHANGELOG.rst | H A D | 06-Oct-2021 | 4.7 KiB | 106 | 105 | |
LICENSE | H A D | 06-Oct-2021 | 1.1 KiB | 22 | 17 | |
MANIFEST.in | H A D | 06-Oct-2021 | 249 | 10 | 9 | |
PKG-INFO | H A D | 06-Oct-2021 | 4.9 KiB | 177 | 123 | |
README.rst | H A D | 06-Oct-2021 | 3.8 KiB | 147 | 96 | |
setup.cfg | H A D | 06-Oct-2021 | 38 | 5 | 3 | |
setup.py | H A D | 06-Oct-2021 | 2.2 KiB | 74 | 61 |
README.rst
1URLExtract 2---------- 3 4URLExtract is python class for collecting (extracting) URLs from given 5text based on locating TLD. 6 7.. image:: https://img.shields.io/travis/lipoja/URLExtract/master.svg 8 :target: https://travis-ci.org/lipoja/URLExtract 9 :alt: Build Status 10.. image:: https://img.shields.io/github/tag/lipoja/URLExtract.svg 11 :target: https://github.com/lipoja/URLExtract/tags 12 :alt: Git tag 13.. image:: https://img.shields.io/pypi/pyversions/urlextract.svg 14 :target: https://pypi.python.org/pypi/urlextract 15 :alt: Python Version Compatibility 16 17 18How does it work 19~~~~~~~~~~~~~~~~ 20 21It tries to find any occurrence of TLD in given text. If TLD is found it 22starts from that position to expand boundaries to both sides searching 23for "stop character" (usually whitespace, comma, single or double 24quote). 25 26A dns check option is available to also reject invalid domain names. 27 28NOTE: List of TLDs is downloaded from iana.org to keep you up to date with new TLDs. 29 30Installation 31~~~~~~~~~~~~ 32 33Package is available on PyPI - you can install it via pip. 34 35.. image:: https://img.shields.io/pypi/v/urlextract.svg 36 :target: https://pypi.python.org/pypi/urlextract 37.. image:: https://img.shields.io/pypi/status/urlextract.svg 38 :target: https://pypi.python.org/pypi/urlextract 39 40:: 41 42 pip install urlextract 43 44Documentation 45~~~~~~~~~~~~~ 46 47Online documentation is published at http://urlextract.readthedocs.io/ 48 49 50Requirements 51~~~~~~~~~~~~ 52 53- IDNA for converting links to IDNA format 54- uritools for domain name validation 55- appdirs for determining user's cache directory 56- dnspython to cache DNS results 57 58 :: 59 60 pip install idna 61 pip install uritools 62 pip install appdirs 63 pip install dnspython 64 65Example 66~~~~~~~ 67 68You can look at command line program at the end of *urlextract.py*. 69But everything you need to know is this: 70 71.. code:: python 72 73 from urlextract import URLExtract 74 75 extractor = URLExtract() 76 urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.") 77 print(urls) # prints: ['janlipovsky.cz'] 78 79Or you can get generator over URLs in text by: 80 81.. code:: python 82 83 from urlextract import URLExtract 84 85 extractor = URLExtract() 86 example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example." 87 88 for url in extractor.gen_urls(example_text): 89 print(url) # prints: ['janlipovsky.cz'] 90 91Or if you want to just check if there is at least one URL you can do: 92 93.. code:: python 94 95 from urlextract import URLExtract 96 97 extractor = URLExtract() 98 example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example." 99 100 if extractor.has_urls(example_text): 101 print("Given text contains some URL") 102 103If you want to have up to date list of TLDs you can use ``update()``: 104 105.. code:: python 106 107 from urlextract import URLExtract 108 109 extractor = URLExtract() 110 extractor.update() 111 112or ``update_when_older()`` method: 113 114.. code:: python 115 116 from urlextract import URLExtract 117 118 extractor = URLExtract() 119 extractor.update_when_older(7) # updates when list is older that 7 days 120 121Known issues 122~~~~~~~~~~~~ 123 124Since TLD can be not only shortcut but also some meaningful word we might see "false matches" when we are searching 125for URL in some HTML pages. The false match can occur for example in css or JS when you are referring to HTML item 126using its classes. 127 128Example HTML code: 129 130.. code-block:: html 131 132 <p class="bold name">Jan</p> 133 <style> 134 p.bold.name { 135 font-weight: bold; 136 } 137 </style> 138 139If this HTML snippet is on the input of ``urlextract.find_urls()`` it will return ``p.bold.name`` as an URL. 140Behavior of urlextract is correct, because ``.name`` is valid TLD and urlextract just see that there is ``bold.name`` 141valid domain name and ``p`` is valid sub-domain. 142 143License 144~~~~~~~ 145 146This piece of code is licensed under The MIT License. 147