• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

dns_crawler/H28-Jul-2020-1,9441,418

dns_crawler.egg-info/H03-May-2022-433300

test/H28-Jul-2020-5628

.flake8H A D27-Jul-2020118 1110

.gitignoreH A D27-Jul-2020102 1312

.gitlab-ci.ymlH A D27-Jul-20202 KiB10695

CHANGELOG.mdH A D28-Jul-202010.6 KiB189131

LICENSEH A D27-Jul-202034.2 KiB675553

PKG-INFOH A D28-Jul-202022.8 KiB433300

README.mdH A D28-Jul-202019 KiB415283

config.ymlH A D28-Jul-20203.7 KiB4847

result-example.jsonH A D28-Jul-2020201.1 KiB867866

result-schema.jsonH A D28-Jul-202022.2 KiB757756

setup.cfgH A D28-Jul-202038 53

setup.pyH A D03-May-20222.4 KiB7251

README.md

1<img alt="CZ.NIC" src="https://www.nic.cz/static/www.nic.cz/images/logo_en.png" align="right" /><br/><br/>
2
3# `dns-crawler`
4
5> A crawler for getting info about *(possibly a huge number of)* DNS domains
6
7[![PyPI version shields.io](https://img.shields.io/pypi/v/dns-crawler.svg)](https://pypi.python.org/pypi/dns-crawler/)
8[![PyPI pyversions](https://img.shields.io/pypi/pyversions/dns-crawler.svg)](https://pypi.python.org/pypi/dns-crawler/)
9[![PyPI license](https://img.shields.io/pypi/l/dns-crawler.svg)](https://pypi.python.org/pypi/dns-crawler/)
10[![PyPI downloads per week](https://img.shields.io/pypi/dm/dns-crawler.svg)](https://pypi.python.org/pypi/dns-crawler/)
11
12# What does it do?
13
14Despite the name, the crawler gets info for more services than just DNS:
15
16- DNS:
17  - all A/AAAA records (for the 2nd level domain and `www.` subdomain), annotated with GeoIP
18  - TXT records (with SPF and DMARC parsed for easier filtering)
19  - TLSA (for the 2nd level domain and `www.` subdomain)
20  - MX
21  - DNSSEC validation
22  - nameservers:
23    - each server IP annotated with GeoIP
24    - HOSTNAME.BIND, VERSION.BIND, AUTHORS.BIND and fortune (also for all IPs)
25  - users can add custom additional RRs in the config file
26- E-mail (for every server from MX):
27  - SMTP server banners (optional, ports are configurable)
28  - TLSA records
29- Web:
30  - HTTP status & headers (inc. parsed cookies) for ports 80 & 443 on each IP from A/AAAA records
31  - certificate info for HTTPS (optionally with an entire cert chain)
32  - webpage content (optional)
33  - everything of the above is saved for each _step_ in the redirect history – the crawler follows redirects until it gets a non-redirecting status or hits a configurable limit
34
35Answers from name and mail servers are cached, so the crawler shouldn't flood hosting providers with repeating queries.
36
37If you need to configure a firewall, the crawler connects to ports `53` (both UDP and TCP), `25` (TCP), `80` (TCP), and `443` (TCP for now, but we might add UDP with HTTP3…).
38
39See [`result-example.json`](https://gitlab.labs.nic.cz/adam/dns-crawler/-/blob/master/result-example.json) to get an idea what the resulting JSON looks like.
40
41## How fast is it anyway?
42
43A single fairly modern laptop on ~50Mbps connection can crawl the entire *.cz* zone (~1.3M second level domains) overnight, give or take, using 8 workers per CPU thread.
44
45Since the crawler is designed to be parallel, the actual speed depends almost entirely on the worker count. And it can scale accross multiple machines almost infinitely, so should you need a million domains crawled in an hour, you can always just throw more hardware at it (see below).
46
47CZ.NIC uses 4 machines in production (8-core Xeon Bronze 3106, 16 GB RAM, gigabit line) and crawling the entire *.cz* zone takes under 3 hours.
48
49## Installation
50
51Create and activate a virtual environment:
52
53```bash
54mkdir dns-crawler
55cd dns-crawler
56python3 -m venv .venv
57source .venv/bin/activate
58```
59
60Install `dns-crawler`:
61
62```bash
63pip install dns-crawler
64```
65
66Depending on your OS/distro, you might need to install some system packages. On Debian/Ubuntu, `apt install ibicu-dev pkg-config build-essential` should do the trick (assumung you already have python3 installed of course).
67
68This is enough to make the crawler work, but you will probably get `AttributeError: module 'dns.message' has no attribute 'Truncated'` for a lot of domains. This is because the crawler uses current `dnspython`, but the last release on PyPI is ages behind the current code. It can be fixed easily just by installing `dnspython` from git:
69
70```bash
71pip install -U git+https://github.com/rthalley/dnspython.git
72```
73
74(PyPI [doesn't allow us](https://github.com/pypa/pip/issues/6301) to specify the git url it in dependencies unfortunately)
75
76## Basic usage
77
78To run a single-threaded crawler (suitable for small domain counts), just pass a domain list:
79
80```
81$ echo -e "nic.cz\nnetmetr.cz\nroot.cz" > domain-list.txt
82$ dns-crawler domain-list.txt > results.json
83[2019-12-03 11:03:54] Reading domains from domain-list.txt.
84[2019-12-03 11:03:54] Read 3 domains.
85[2019-12-03 11:03:55] 1/3
86[2019-12-03 11:03:55] 2/3
87[2019-12-03 11:03:56] 3/3
88[2019-12-03 11:03:56] Finished.
89```
90
91Results are printed to stdout – JSON for every domain, separated by `\n`:
92
93```
94$ cat results.json
95{"domain": "nic.cz", "timestamp": "2019-12-03 10:03:55", "results": {…}}
96{"domain": "netmetr.cz", "timestamp": "2019-12-03 10:03:55", "results": {…}}
97{"domain": "root.cz", "timestamp": "2019-12-03 10:03:56", "results": {…}}
98```
99
100If you want formatted JSONs, just pipe the output through [jq](https://stedolan.github.io/jq/) or your tool of choice: `dns-crawler domain-list.txt | jq`.
101
102## Multithreaded crawling
103
104First, you need a Redis server running & listening.
105
106The crawler can run with multiple threads to speed things up when you have a lot of domains to go through. Communication betweeen the controller and workers is done through Redis (this makes it easy to run workers on multiple machines if needed, see below).
107
108Start Redis. The exact command depends on your system. If you want to use a different machine for Redis & the crawler controller, see [CLI parameters for dns-crawler-controller](#dns-crawler-controller).
109
110Feed domains into queue and wait for results:
111
112```
113$ dns-crawler-controller domain-list.txt > result.json
114```
115
116(in another shell) Start workers which process the domains and return results to the controller:
117
118```
119$ dns-crawler-workers
120```
121
122Using the controller also gives you caching of repeating queries (mailserver banners and hostname.bind/version.bind for nameservers) for free.
123
124### Redis configuration
125
126No special config needed, but increase the memory limit if you have a lot of domains to process (eg. `maxmemory 2G`). You can also disable disk snapshots to save some I/O time (comment out the `save …` lines). If you're not already using Redis for other things, read its log – there are often some recommendations for performance improvements.
127
128## Results
129
130Results are printed to the main process' (`dns-crawler` or `dns-crawler-controller`) stdout – JSON for every domain, separated by `\n`:
131
132```
133134[2019-05-03 07:38:17] 2/3
135{"domain": "nic.cz", "timestamp": "2019-09-24T05:28:06.536991", "results": {…}}
136137```
138
139The progress info with timestamp is printed to stderr, so you can save just the output easily – `dns-crawler list.txt > results`.
140
141A JSON schema for the output JSON is included in the repository: [`result-schema.json`](https://gitlab.labs.nic.cz/adam/dns-crawler/-/blob/master/result-schema.json), and also an example for nic.cz: [`result-example.json`](https://gitlab.labs.nic.cz/adam/dns-crawler/-/blob/master/result-example.json).
142
143There are several tools for schema validation, viewing, and even code generation.
144
145To validate a result against schema (CI is set up to do it automatically):
146
147```bash
148$ pip install jsonschema
149$ jsonschema -i result-example.json result-schema.json # if it prints nothing, it's valid
150```
151
152Or, if you don't loathe JS, `ajv` has a much better output:
153
154```bash
155$ npm i -g ajv-cli
156$ ajv validate -s result-schema.json -d result-example.json
157```
158
159### Storing crawler results
160
161In production, CZ.NIC uses Hadoop cluster to store the results file after the crawler run is over – see a script in `utils/crawler-hadoop.sh` (pushes the results file to Hadoop and notifies a Mattermost channel).
162
163You can even pipe the output right to hadoop without even storing it on your disk:
164
165```
166dns-crawler-controller domain-list.txt | ssh user@hadoop-node "HADOOP_USER_NAME=… hadoop fs -put - /path/to/results.json;"
167```
168
169### Working with the results
170
171- [R package for dns-crawler output processing](https://gitlab.labs.nic.cz/adam/dnscrawler.parser)
172
173## Usage in Python code
174
175Just import and use the `process_domain` function like so:
176
177```
178$ python
179>>> from dns_crawler.crawl import process_domain
180>>> result = process_domain("nic.cz")
181>>> result
182{'domain': 'nic.cz', 'timestamp': '2019-09-13T09:21:10.136303', 'results': { …
183>>>
184>>> result["results"]["DNS_LOCAL"]["DNS_AUTH"]
185[{'value': 'a.ns.nic.cz.'}, {'value': 'b.ns.nic.cz.'}, {'value': 'd.ns.nic.cz.'}]
186```
187
188The `process_domain` function returns Python `dict`s. If you want json, use `from dns_crawler.crawl import get_json_result` instead:
189
190```
191$ python
192>>> from dns_crawler.crawl import get_json_result
193>>> result = get_json_result("nic.cz")
194>>> result
195# same as above, just converted to JSON
196```
197
198This function just calls `crawl_domain` and converts the `dict` to JSON string. It's used by the workers, so the conversion is done by them to take some pressure off the controller process.
199
200
201## Config file
202
203GeoIP DB paths, DNS resolver IP(s), and timeouts are read from `config.yml` in the working directory, if present.
204
205The default values are listen in (`config.yml`)[https://gitlab.nic.cz/adam/dns-crawler/-/blob/master/config.yml] with explanatory comments.
206
207If you're using the multi-threaded crawler (`dns-crawler-controller` & `dns-crawler-workers`), the config is loaded by the controlled and shared with the workers via Redis.
208
209You can override it on the worker machines if needed – just create a `config.yml` in their working dir (eg. to set different resolver IP(s) or GeoIP paths on each machine). The config is then merged – directives not defined in the worker config are loaded from the controller one (and defaults are used if the're not defined there either). But – depending on values you change – you might then get a different results from each worker machine of course.
210
211### Using commercial GeoIP DBs
212
213Tell the crawler to use (GeoIP2 Country and ISP) DBs instead of free (GeoLite2 Country and ASN) ones:
214
215```yaml
216geoip:
217  country: /usr/share/GeoIP/GeoLite2-Country.mmdb
218  #  asn: /usr/share/GeoIP/GeoLite2-ASN.mmdb  # 'asn' is the free DB
219  isp: /usr/share/GeoIP/GeoIP2-ISP.mmdb  # 'isp' is the commercial one
220```
221
222(use either absolute paths or relative to the working directory)
223
224`ISP` (paid) database is preferred over `ASN` (free), if both are defined. The difference is described on Maxmind's website: https://dev.maxmind.com/faq/what-is-the-difference-between-the-geoip-isp-and-organization-databases/.
225
226The free `GeoLite2-Country` seems to be a bit inaccurate, especially for IPv6 (it places some CZ.NIC nameservers in Ukraine etc.).
227
228### Getting additional DNS resource records:
229
230You can easily get some additional RRs (for the 2nd level domain) which aren't included in the crawler by default:
231
232```yaml
233dns:
234  additional:
235    - SPF
236    - CAA
237    - CERT
238    - LOC
239    - SSHFP
240```
241
242See the [List of DNS record types](https://en.wikipedia.org/wiki/List_of_DNS_record_types) for some ideas. Things like OPENPGPKEY won't work though, because they are intented to be used on a subdomain (generated as a hash of part of e-mail address in this case).
243
244You can plug a parser for the record by adding a function to the `additional_parsers` enum in `dns_utils.py`. The only one included by default is SPF (since the deprecated SPF record has the same format as SPF from TXT which the crawler is getting by default).
245
246## Command line parameters
247
248### dns-crawler
249
250```
251dns-crawler - a single-threaded crawler to process a small number of domains without a need for Redis
252
253Usage: dns-crawler <file>
254       file - plaintext domain list, one domain per line, empty lines are ignored
255```
256
257### dns-crawler-controller
258
259```
260dns-crawler-controller - the main process controlling the job queue and printing results.
261
262Usage: dns-crawler-controller <file> [redis]
263       file - plaintext domain list, one domain per line, empty lines are ignored
264       redis - redis host:port:db, localhost:6379:0 by default
265
266Examples: dns-crawler-controller domains.txt
267          dns-crawler-controller domains.txt 192.168.0.22:4444:0
268          dns-crawler-controller domains.txt redis.foo.bar:7777:2
269          dns-crawler-controller domains.txt redis.foo.bar # port 6379 and DB 0 will be used if not specified
270```
271
272The controller process uses threads (4 for each CPU core) to create the jobs faster when you give it a lot of domains (>1000× CPU core count).
273
274It's *much* faster on (more) modern machines – eg. i7-7600U (with HT) in a laptop does about 19k jobs/s, while server with Xeon X3430 (without HT) does just about ~7k (both using 16 threads, as they both appear as 4 core to the system).
275
276To cancel the process, just send a kill signal or hit `Ctrl-C` any time. The process will perform cleanup and exit.
277
278### dns-crawler-workers
279
280```
281dns-crawler-workers - a process that spawns crawler workers.
282
283Usage: dns-crawler-workers [count] [redis]
284       count - worker count, 8 workers per CPU core by default
285       redis - redis host:port:db, localhost:6379:0 by default
286
287Examples: dns-crawler-workers 8
288          dns-crawler-workers 24 192.168.0.22:4444:0
289          dns-crawler-workers 16 redis.foo.bar:7777:2
290          dns-crawler-workers 16 redis.foo.bar # port 6379 and DB 0 will be used if not specified
291```
292
293Trying to use more than 24 workers per CPU core will result in a warning (and countdown before it actually starts the workers):
294
295```
296$ dns-crawler-workers 999
297Whoa. You are trying to run 999 workers on 4 CPU cores. It's easy toscale
298across multiple machines, if you need to. See README.md for details.
299
300Cancel now (Ctrl-C) or have a fire extinguisher ready.
3015 - 4 - 3 -
302```
303
304Stopping works the same way as with the controller process – `Ctrl-C` (or kill signal) will finish the current job(s) and exit.
305
306## Resuming work
307
308Stopping the workers won't delete the jobs from Redis. So, if you stop the `dns-crawler-workers` process and then start a new one (perhaps to use different worker count…), it will pick up the unfinished jobs and continue.
309
310This can also be used change the worker count if it turns out to be too low or high for your machine or network:
311
312- to reduce the worker count, just stop the `dns-crawler-workers` process and start a new one with a new count
313- to increase the worker count, either use the same approach, or just start a second `dns-crawler-workers` process in another shell, the worker count will just add up
314- scaling to multiple machines works the same way, see below
315
316## Running on multiple machines
317
318Since all communication between the controller and workers is done through Redis, it's easy to scale the crawler to any number of machines:
319
320```
321machine-1                     machine-1
322┬───────────────────────────┐         ┬─────────────────────┐
323│    dns-crawler-controller │ ------- │ dns-crawler-workers │
324│             +             │         └─────────────────────┘
325│           redis           │
326│             +             │
327│        DNS resolver       │
328└───────────────────────────┘
329                                      machine-2
330                                      ┬─────────────────────┐
331                              ------- │ dns-crawler-workers │
332                                      └─────────────────────┘
333334335
336                                      machine-n
337                                      ┬─────────────────────┐
338                              _______ │ dns-crawler-workers │
339                                      └─────────────────────┘
340```
341
342Just tell the workers to connect to the shared Redis on the main server, eg.:
343
344```
345$ dns-crawler-workers 24 192.168.0.2:6379
346                    ^            ^
347                    24 threads   redis host
348```
349
350Make sure to run the workers with ~same Python version on these machines, otherwise you'll get `unsupported pickle protocol` errors. See the [pickle protocol versions in Python docs](https://docs.python.org/3.8/library/pickle.html#data-stream-format).
351
352The DNS resolver doesn't have to be on a same machine as the `dns-crawler-controller`, of course – just set it's IP in `config.yml`. The crawler is tested primarily with CZ.NIC's [Knot Resolver](https://www.knot-resolver.cz/), but should work with any sane resolver supporting DNSSEC. Systemd's `systemd-resolved` seems to be really slow though.
353
354Same goes for Redis, you can point both controller and workers to a separate machine running Redis (don't forget to point them to an empty DB if you're using Redis for other things than the dns-crawler, it uses `0` by default).
355
356## Updating dependencies
357
358MaxMind updates GeoIP DBs on Tuesdays, so it may be a good idea to set a cron job to keep them fresh. More about that on [maxmind.com: Automatic Updates for GeoIP2](https://dev.maxmind.com/geoip/geoipupdate/).
359
360If you use multiple machines to run the workers, don't forget to update GeoIP on all of them (or set up a shared location, eg. via sshfs or nfs).
361
362## Monitoring
363
364### Command line
365
366```
367$ rq info
368default      |████████████████████ 219458
3691 queues, 219458 jobs total
370
3710 workers, 1 queues
372```
373
374### Web interface
375
376```
377$ pip install rq-dashboard
378$ rq-dashboard
379RQ Dashboard version 0.4.0
380 * Serving Flask app "rq_dashboard.cli" (lazy loading)
381 * Environment: production
382   WARNING: Do not use the development server in a production environment.
383   Use a production WSGI server instead.
384 * Debug mode: off
385 * Running on http://0.0.0.0:9181/ (Press CTRL+C to quit)
386 ```
387
388<a href="https://i.vgy.me/sk7zWa.png">
389<img alt="RQ Dashboard screenshot" src="https://i.vgy.me/sk7zWa.png" width="40%">
390</a>
391<a href="https://i.vgy.me/4y5Zee.png">
392<img alt="RQ Dashboard screenshot" src="https://i.vgy.me/4y5Zee.png" width="40%">
393</a>
394
395## Tests
396
397Some basic tests are in the `tests` directory in this repo. If you want to run them manually, take a look at the `test` stage jobs in `.gitlab-ci.yml`. Basically it just downloads free GeoIP DBs, tells the crawler to use them, and crawles some domains, checking values in JSON output. It runs the tests twice – first with the default DNS resolvers (ODVR) and then with system one(s).
398
399If you're looking into writing some additional tests, be aware that some Docker containers used in GitLab CI don't have IPv6 configured (even if it's working on the host machine), so checking for eg. `WEB6_80_www_VENDOR` will fail without additional setup.
400
401
402## OS support
403
404The crawler is developed primarily for Linux, but it should work on any OS supported by Python – at least the worker part (but the controller should work too, if you manage to get a Redis server running on your OS).
405
406One exception is Windows, because it [doesn't support `fork()`](https://github.com/rq/rq/issues/859), but it's possible to get it working under WSL (Windows Subsystem for Linux):
407
408![win10 screenshot](https://i.vgy.me/emJjGN.png)
409
410…so you can turn a gaming machine into an internet crawler quite easily.
411
412
413## Bug reporting
414
415Please create [issues in this Gitlab repo](https://gitlab.labs.nic.cz/adam/dns-crawler/issues).