• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

docs/H06-Oct-2021-35370

tests/unit/H06-Oct-2021-795599

urlextract/H06-Oct-2021-1,225937

urlextract.egg-info/H03-May-2022-177123

CHANGELOG.rstH A D06-Oct-20214.7 KiB106105

LICENSEH A D06-Oct-20211.1 KiB2217

MANIFEST.inH A D06-Oct-2021249 109

PKG-INFOH A D06-Oct-20214.9 KiB177123

README.rstH A D06-Oct-20213.8 KiB14796

setup.cfgH A D06-Oct-202138 53

setup.pyH A D06-Oct-20212.2 KiB7461

README.rst

1URLExtract
2----------
3
4URLExtract is python class for collecting (extracting) URLs from given
5text based on locating TLD.
6
7.. image:: https://img.shields.io/travis/lipoja/URLExtract/master.svg
8    :target: https://travis-ci.org/lipoja/URLExtract
9    :alt: Build Status
10.. image:: https://img.shields.io/github/tag/lipoja/URLExtract.svg
11    :target: https://github.com/lipoja/URLExtract/tags
12    :alt: Git tag
13.. image:: https://img.shields.io/pypi/pyversions/urlextract.svg
14    :target: https://pypi.python.org/pypi/urlextract
15    :alt: Python Version Compatibility
16
17
18How does it work
19~~~~~~~~~~~~~~~~
20
21It tries to find any occurrence of TLD in given text. If TLD is found it
22starts from that position to expand boundaries to both sides searching
23for "stop character" (usually whitespace, comma, single or double
24quote).
25
26A dns check option is available to also reject invalid domain names.
27
28NOTE: List of TLDs is downloaded from iana.org to keep you up to date with new TLDs.
29
30Installation
31~~~~~~~~~~~~
32
33Package is available on PyPI - you can install it via pip.
34
35.. image:: https://img.shields.io/pypi/v/urlextract.svg
36    :target: https://pypi.python.org/pypi/urlextract
37.. image:: https://img.shields.io/pypi/status/urlextract.svg
38    :target: https://pypi.python.org/pypi/urlextract
39
40::
41
42   pip install urlextract
43
44Documentation
45~~~~~~~~~~~~~
46
47Online documentation is published at http://urlextract.readthedocs.io/
48
49
50Requirements
51~~~~~~~~~~~~
52
53- IDNA for converting links to IDNA format
54- uritools for domain name validation
55- appdirs for determining user's cache directory
56- dnspython to cache DNS results
57
58   ::
59
60       pip install idna
61       pip install uritools
62       pip install appdirs
63       pip install dnspython
64
65Example
66~~~~~~~
67
68You can look at command line program at the end of *urlextract.py*.
69But everything you need to know is this:
70
71.. code:: python
72
73    from urlextract import URLExtract
74
75    extractor = URLExtract()
76    urls = extractor.find_urls("Text with URLs. Let's have URL janlipovsky.cz as an example.")
77    print(urls) # prints: ['janlipovsky.cz']
78
79Or you can get generator over URLs in text by:
80
81.. code:: python
82
83    from urlextract import URLExtract
84
85    extractor = URLExtract()
86    example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example."
87
88    for url in extractor.gen_urls(example_text):
89        print(url) # prints: ['janlipovsky.cz']
90
91Or if you want to just check if there is at least one URL you can do:
92
93.. code:: python
94
95    from urlextract import URLExtract
96
97    extractor = URLExtract()
98    example_text = "Text with URLs. Let's have URL janlipovsky.cz as an example."
99
100    if extractor.has_urls(example_text):
101        print("Given text contains some URL")
102
103If you want to have up to date list of TLDs you can use ``update()``:
104
105.. code:: python
106
107    from urlextract import URLExtract
108
109    extractor = URLExtract()
110    extractor.update()
111
112or ``update_when_older()`` method:
113
114.. code:: python
115
116    from urlextract import URLExtract
117
118    extractor = URLExtract()
119    extractor.update_when_older(7) # updates when list is older that 7 days
120
121Known issues
122~~~~~~~~~~~~
123
124Since TLD can be not only shortcut but also some meaningful word we might see "false matches" when we are searching
125for URL in some HTML pages. The false match can occur for example in css or JS when you are referring to HTML item
126using its classes.
127
128Example HTML code:
129
130.. code-block:: html
131
132  <p class="bold name">Jan</p>
133  <style>
134    p.bold.name {
135      font-weight: bold;
136    }
137  </style>
138
139If this HTML snippet is on the input of ``urlextract.find_urls()`` it will return ``p.bold.name`` as an URL.
140Behavior of urlextract is correct, because ``.name`` is valid TLD and urlextract just see that there is ``bold.name``
141valid domain name and ``p`` is valid sub-domain.
142
143License
144~~~~~~~
145
146This piece of code is licensed under The MIT License.
147