1NAME 2 AnyEvent::HTTP - simple but non-blocking HTTP/HTTPS client 3 4SYNOPSIS 5 use AnyEvent::HTTP; 6 7 http_get "http://www.nethype.de/", sub { print $_[1] }; 8 9 # ... do something else here 10 11DESCRIPTION 12 This module is an AnyEvent user, you need to make sure that you use and 13 run a supported event loop. 14 15 This module implements a simple, stateless and non-blocking HTTP client. 16 It supports GET, POST and other request methods, cookies and more, all 17 on a very low level. It can follow redirects, supports proxies, and 18 automatically limits the number of connections to the values specified 19 in the RFC. 20 21 It should generally be a "good client" that is enough for most HTTP 22 tasks. Simple tasks should be simple, but complex tasks should still be 23 possible as the user retains control over request and response headers. 24 25 The caller is responsible for authentication management, cookies (if the 26 simplistic implementation in this module doesn't suffice), referer and 27 other high-level protocol details for which this module offers only 28 limited support. 29 30 METHODS 31 http_get $url, key => value..., $cb->($data, $headers) 32 Executes an HTTP-GET request. See the http_request function for 33 details on additional parameters and the return value. 34 35 http_head $url, key => value..., $cb->($data, $headers) 36 Executes an HTTP-HEAD request. See the http_request function for 37 details on additional parameters and the return value. 38 39 http_post $url, $body, key => value..., $cb->($data, $headers) 40 Executes an HTTP-POST request with a request body of $body. See the 41 http_request function for details on additional parameters and the 42 return value. 43 44 http_request $method => $url, key => value..., $cb->($data, $headers) 45 Executes a HTTP request of type $method (e.g. "GET", "POST"). The 46 URL must be an absolute http or https URL. 47 48 When called in void context, nothing is returned. In other contexts, 49 "http_request" returns a "cancellation guard" - you have to keep the 50 object at least alive until the callback get called. If the object 51 gets destroyed before the callback is called, the request will be 52 cancelled. 53 54 The callback will be called with the response body data as first 55 argument (or "undef" if an error occurred), and a hash-ref with 56 response headers (and trailers) as second argument. 57 58 All the headers in that hash are lowercased. In addition to the 59 response headers, the "pseudo-headers" (uppercase to avoid clashing 60 with possible response headers) "HTTPVersion", "Status" and "Reason" 61 contain the three parts of the HTTP Status-Line of the same name. If 62 an error occurs during the body phase of a request, then the 63 original "Status" and "Reason" values from the header are available 64 as "OrigStatus" and "OrigReason". 65 66 The pseudo-header "URL" contains the actual URL (which can differ 67 from the requested URL when following redirects - for example, you 68 might get an error that your URL scheme is not supported even though 69 your URL is a valid http URL because it redirected to an ftp URL, in 70 which case you can look at the URL pseudo header). 71 72 The pseudo-header "Redirect" only exists when the request was a 73 result of an internal redirect. In that case it is an array 74 reference with the "($data, $headers)" from the redirect response. 75 Note that this response could in turn be the result of a redirect 76 itself, and "$headers->{Redirect}[1]{Redirect}" will then contain 77 the original response, and so on. 78 79 If the server sends a header multiple times, then their contents 80 will be joined together with a comma (","), as per the HTTP spec. 81 82 If an internal error occurs, such as not being able to resolve a 83 hostname, then $data will be "undef", "$headers->{Status}" will be 84 590-599 and the "Reason" pseudo-header will contain an error 85 message. Currently the following status codes are used: 86 87 595 - errors during connection establishment, proxy handshake. 88 596 - errors during TLS negotiation, request sending and header 89 processing. 90 597 - errors during body receiving or processing. 91 598 - user aborted request via "on_header" or "on_body". 92 599 - other, usually nonretryable, errors (garbled URL etc.). 93 94 A typical callback might look like this: 95 96 sub { 97 my ($body, $hdr) = @_; 98 99 if ($hdr->{Status} =~ /^2/) { 100 ... everything should be ok 101 } else { 102 print "error, $hdr->{Status} $hdr->{Reason}\n"; 103 } 104 } 105 106 Additional parameters are key-value pairs, and are fully optional. 107 They include: 108 109 recurse => $count (default: $MAX_RECURSE) 110 Whether to recurse requests or not, e.g. on redirects, 111 authentication and other retries and so on, and how often to do 112 so. 113 114 Only redirects to http and https URLs are supported. While most 115 common redirection forms are handled entirely within this 116 module, some require the use of the optional URI module. If it 117 is required but missing, then the request will fail with an 118 error. 119 120 headers => hashref 121 The request headers to use. Currently, "http_request" may 122 provide its own "Host:", "Content-Length:", "Connection:" and 123 "Cookie:" headers and will provide defaults at least for "TE:", 124 "Referer:" and "User-Agent:" (this can be suppressed by using 125 "undef" for these headers in which case they won't be sent at 126 all). 127 128 You really should provide your own "User-Agent:" header value 129 that is appropriate for your program - I wouldn't be surprised 130 if the default AnyEvent string gets blocked by webservers sooner 131 or later. 132 133 Also, make sure that your headers names and values do not 134 contain any embedded newlines. 135 136 timeout => $seconds 137 The time-out to use for various stages - each connect attempt 138 will reset the timeout, as will read or write activity, i.e. 139 this is not an overall timeout. 140 141 Default timeout is 5 minutes. 142 143 proxy => [$host, $port[, $scheme]] or undef 144 Use the given http proxy for all requests, or no proxy if 145 "undef" is used. 146 147 $scheme must be either missing or must be "http" for HTTP. 148 149 If not specified, then the default proxy is used (see 150 "AnyEvent::HTTP::set_proxy"). 151 152 Currently, if your proxy requires authorization, you have to 153 specify an appropriate "Proxy-Authorization" header in every 154 request. 155 156 body => $string 157 The request body, usually empty. Will be sent as-is (future 158 versions of this module might offer more options). 159 160 cookie_jar => $hash_ref 161 Passing this parameter enables (simplified) cookie-processing, 162 loosely based on the original netscape specification. 163 164 The $hash_ref must be an (initially empty) hash reference which 165 will get updated automatically. It is possible to save the 166 cookie jar to persistent storage with something like JSON or 167 Storable - see the "AnyEvent::HTTP::cookie_jar_expire" function 168 if you wish to remove expired or session-only cookies, and also 169 for documentation on the format of the cookie jar. 170 171 Note that this cookie implementation is not meant to be 172 complete. If you want complete cookie management you have to do 173 that on your own. "cookie_jar" is meant as a quick fix to get 174 most cookie-using sites working. Cookies are a privacy disaster, 175 do not use them unless required to. 176 177 When cookie processing is enabled, the "Cookie:" and 178 "Set-Cookie:" headers will be set and handled by this module, 179 otherwise they will be left untouched. 180 181 tls_ctx => $scheme | $tls_ctx 182 Specifies the AnyEvent::TLS context to be used for https 183 connections. This parameter follows the same rules as the 184 "tls_ctx" parameter to AnyEvent::Handle, but additionally, the 185 two strings "low" or "high" can be specified, which give you a 186 predefined low-security (no verification, highest compatibility) 187 and high-security (CA and common-name verification) TLS context. 188 189 The default for this option is "low", which could be interpreted 190 as "give me the page, no matter what". 191 192 See also the "sessionid" parameter. 193 194 session => $string 195 The module might reuse connections to the same host internally. 196 Sometimes (e.g. when using TLS), you do not want to reuse 197 connections from other sessions. This can be achieved by setting 198 this parameter to some unique ID (such as the address of an 199 object storing your state data, or the TLS context) - only 200 connections using the same unique ID will be reused. 201 202 on_prepare => $callback->($fh) 203 In rare cases you need to "tune" the socket before it is used to 204 connect (for example, to bind it on a given IP address). This 205 parameter overrides the prepare callback passed to 206 "AnyEvent::Socket::tcp_connect" and behaves exactly the same way 207 (e.g. it has to provide a timeout). See the description for the 208 $prepare_cb argument of "AnyEvent::Socket::tcp_connect" for 209 details. 210 211 tcp_connect => $callback->($host, $service, $connect_cb, 212 $prepare_cb) 213 In even rarer cases you want total control over how 214 AnyEvent::HTTP establishes connections. Normally it uses 215 AnyEvent::Socket::tcp_connect to do this, but you can provide 216 your own "tcp_connect" function - obviously, it has to follow 217 the same calling conventions, except that it may always return a 218 connection guard object. 219 220 There are probably lots of weird uses for this function, 221 starting from tracing the hosts "http_request" actually tries to 222 connect, to (inexact but fast) host => IP address caching or 223 even socks protocol support. 224 225 on_header => $callback->($headers) 226 When specified, this callback will be called with the header 227 hash as soon as headers have been successfully received from the 228 remote server (not on locally-generated errors). 229 230 It has to return either true (in which case AnyEvent::HTTP will 231 continue), or false, in which case AnyEvent::HTTP will cancel 232 the download (and call the finish callback with an error code of 233 598). 234 235 This callback is useful, among other things, to quickly reject 236 unwanted content, which, if it is supposed to be rare, can be 237 faster than first doing a "HEAD" request. 238 239 The downside is that cancelling the request makes it impossible 240 to re-use the connection. Also, the "on_header" callback will 241 not receive any trailer (headers sent after the response body). 242 243 Example: cancel the request unless the content-type is 244 "text/html". 245 246 on_header => sub { 247 $_[0]{"content-type"} =~ /^text\/html\s*(?:;|$)/ 248 }, 249 250 on_body => $callback->($partial_body, $headers) 251 When specified, all body data will be passed to this callback 252 instead of to the completion callback. The completion callback 253 will get the empty string instead of the body data. 254 255 It has to return either true (in which case AnyEvent::HTTP will 256 continue), or false, in which case AnyEvent::HTTP will cancel 257 the download (and call the completion callback with an error 258 code of 598). 259 260 The downside to cancelling the request is that it makes it 261 impossible to re-use the connection. 262 263 This callback is useful when the data is too large to be held in 264 memory (so the callback writes it to a file) or when only some 265 information should be extracted, or when the body should be 266 processed incrementally. 267 268 It is usually preferred over doing your own body handling via 269 "want_body_handle", but in case of streaming APIs, where HTTP is 270 only used to create a connection, "want_body_handle" is the 271 better alternative, as it allows you to install your own event 272 handler, reducing resource usage. 273 274 want_body_handle => $enable 275 When enabled (default is disabled), the behaviour of 276 AnyEvent::HTTP changes considerably: after parsing the headers, 277 and instead of downloading the body (if any), the completion 278 callback will be called. Instead of the $body argument 279 containing the body data, the callback will receive the 280 AnyEvent::Handle object associated with the connection. In error 281 cases, "undef" will be passed. When there is no body (e.g. 282 status 304), the empty string will be passed. 283 284 The handle object might or might not be in TLS mode, might be 285 connected to a proxy, be a persistent connection, use chunked 286 transfer encoding etc., and configured in unspecified ways. The 287 user is responsible for this handle (it will not be used by this 288 module anymore). 289 290 This is useful with some push-type services, where, after the 291 initial headers, an interactive protocol is used (typical 292 example would be the push-style twitter API which starts a 293 JSON/XML stream). 294 295 If you think you need this, first have a look at "on_body", to 296 see if that doesn't solve your problem in a better way. 297 298 persistent => $boolean 299 Try to create/reuse a persistent connection. When this flag is 300 set (default: true for idempotent requests, false for all 301 others), then "http_request" tries to re-use an existing 302 (previously-created) persistent connection to the host and, 303 failing that, tries to create a new one. 304 305 Requests failing in certain ways will be automatically retried 306 once, which is dangerous for non-idempotent requests, which is 307 why it defaults to off for them. The reason for this is because 308 the bozos who designed HTTP/1.1 made it impossible to 309 distinguish between a fatal error and a normal connection 310 timeout, so you never know whether there was a problem with your 311 request or not. 312 313 When reusing an existent connection, many parameters (such as 314 TLS context) will be ignored. See the "session" parameter for a 315 workaround. 316 317 keepalive => $boolean 318 Only used when "persistent" is also true. This parameter decides 319 whether "http_request" tries to handshake a HTTP/1.0-style 320 keep-alive connection (as opposed to only a HTTP/1.1 persistent 321 connection). 322 323 The default is true, except when using a proxy, in which case it 324 defaults to false, as HTTP/1.0 proxies cannot support this in a 325 meaningful way. 326 327 handle_params => { key => value ... } 328 The key-value pairs in this hash will be passed to any 329 AnyEvent::Handle constructor that is called - not all requests 330 will create a handle, and sometimes more than one is created, so 331 this parameter is only good for setting hints. 332 333 Example: set the maximum read size to 4096, to potentially 334 conserve memory at the cost of speed. 335 336 handle_params => { 337 max_read_size => 4096, 338 }, 339 340 Example: do a simple HTTP GET request for http://www.nethype.de/ and 341 print the response body. 342 343 http_request GET => "http://www.nethype.de/", sub { 344 my ($body, $hdr) = @_; 345 print "$body\n"; 346 }; 347 348 Example: do a HTTP HEAD request on https://www.google.com/, use a 349 timeout of 30 seconds. 350 351 http_request 352 HEAD => "https://www.google.com", 353 headers => { "user-agent" => "MySearchClient 1.0" }, 354 timeout => 30, 355 sub { 356 my ($body, $hdr) = @_; 357 use Data::Dumper; 358 print Dumper $hdr; 359 } 360 ; 361 362 Example: do another simple HTTP GET request, but immediately try to 363 cancel it. 364 365 my $request = http_request GET => "http://www.nethype.de/", sub { 366 my ($body, $hdr) = @_; 367 print "$body\n"; 368 }; 369 370 undef $request; 371 372 DNS CACHING 373 AnyEvent::HTTP uses the AnyEvent::Socket::tcp_connect function for the 374 actual connection, which in turn uses AnyEvent::DNS to resolve 375 hostnames. The latter is a simple stub resolver and does no caching on 376 its own. If you want DNS caching, you currently have to provide your own 377 default resolver (by storing a suitable resolver object in 378 $AnyEvent::DNS::RESOLVER) or your own "tcp_connect" callback. 379 380 GLOBAL FUNCTIONS AND VARIABLES 381 AnyEvent::HTTP::set_proxy "proxy-url" 382 Sets the default proxy server to use. The proxy-url must begin with 383 a string of the form "http://host:port", croaks otherwise. 384 385 To clear an already-set proxy, use "undef". 386 387 When AnyEvent::HTTP is loaded for the first time it will query the 388 default proxy from the operating system, currently by looking at 389 "$ENV{http_proxy"}. 390 391 AnyEvent::HTTP::cookie_jar_expire $jar[, $session_end] 392 Remove all cookies from the cookie jar that have been expired. If 393 $session_end is given and true, then additionally remove all session 394 cookies. 395 396 You should call this function (with a true $session_end) before you 397 save cookies to disk, and you should call this function after 398 loading them again. If you have a long-running program you can 399 additionally call this function from time to time. 400 401 A cookie jar is initially an empty hash-reference that is managed by 402 this module. Its format is subject to change, but currently it is as 403 follows: 404 405 The key "version" has to contain 1, otherwise the hash gets emptied. 406 All other keys are hostnames or IP addresses pointing to 407 hash-references. The key for these inner hash references is the 408 server path for which this cookie is meant, and the values are again 409 hash-references. Each key of those hash-references is a cookie name, 410 and the value, you guessed it, is another hash-reference, this time 411 with the key-value pairs from the cookie, except for "expires" and 412 "max-age", which have been replaced by a "_expires" key that 413 contains the cookie expiry timestamp. Session cookies are indicated 414 by not having an "_expires" key. 415 416 Here is an example of a cookie jar with a single cookie, so you have 417 a chance of understanding the above paragraph: 418 419 { 420 version => 1, 421 "10.0.0.1" => { 422 "/" => { 423 "mythweb_id" => { 424 _expires => 1293917923, 425 value => "ooRung9dThee3ooyXooM1Ohm", 426 }, 427 }, 428 }, 429 } 430 431 $date = AnyEvent::HTTP::format_date $timestamp 432 Takes a POSIX timestamp (seconds since the epoch) and formats it as 433 a HTTP Date (RFC 2616). 434 435 $timestamp = AnyEvent::HTTP::parse_date $date 436 Takes a HTTP Date (RFC 2616) or a Cookie date (netscape cookie spec) 437 or a bunch of minor variations of those, and returns the 438 corresponding POSIX timestamp, or "undef" if the date cannot be 439 parsed. 440 441 $AnyEvent::HTTP::MAX_RECURSE 442 The default value for the "recurse" request parameter (default: 10). 443 444 $AnyEvent::HTTP::TIMEOUT 445 The default timeout for connection operations (default: 300). 446 447 $AnyEvent::HTTP::USERAGENT 448 The default value for the "User-Agent" header (the default is 449 "Mozilla/5.0 (compatible; U; AnyEvent-HTTP/$VERSION; 450 +http://software.schmorp.de/pkg/AnyEvent)"). 451 452 $AnyEvent::HTTP::MAX_PER_HOST 453 The maximum number of concurrent connections to the same host 454 (identified by the hostname). If the limit is exceeded, then 455 additional requests are queued until previous connections are 456 closed. Both persistent and non-persistent connections are counted 457 in this limit. 458 459 The default value for this is 4, and it is highly advisable to not 460 increase it much. 461 462 For comparison: the RFC's recommend 4 non-persistent or 2 persistent 463 connections, older browsers used 2, newer ones (such as firefox 3) 464 typically use 6, and Opera uses 8 because like, they have the 465 fastest browser and give a shit for everybody else on the planet. 466 467 $AnyEvent::HTTP::PERSISTENT_TIMEOUT 468 The time after which idle persistent connections get closed by 469 AnyEvent::HTTP (default: 3). 470 471 $AnyEvent::HTTP::ACTIVE 472 The number of active connections. This is not the number of 473 currently running requests, but the number of currently open and 474 non-idle TCP connections. This number can be useful for 475 load-leveling. 476 477 SHOWCASE 478 This section contains some more elaborate "real-world" examples or code 479 snippets. 480 481 HTTP/1.1 FILE DOWNLOAD 482 Downloading files with HTTP can be quite tricky, especially when 483 something goes wrong and you want to resume. 484 485 Here is a function that initiates and resumes a download. It uses the 486 last modified time to check for file content changes, and works with 487 many HTTP/1.0 servers as well, and usually falls back to a complete 488 re-download on older servers. 489 490 It calls the completion callback with either "undef", which means a 491 nonretryable error occurred, 0 when the download was partial and should 492 be retried, and 1 if it was successful. 493 494 use AnyEvent::HTTP; 495 496 sub download($$$) { 497 my ($url, $file, $cb) = @_; 498 499 open my $fh, "+<", $file 500 or die "$file: $!"; 501 502 my %hdr; 503 my $ofs = 0; 504 505 warn stat $fh; 506 warn -s _; 507 if (stat $fh and -s _) { 508 $ofs = -s _; 509 warn "-s is ", $ofs; 510 $hdr{"if-unmodified-since"} = AnyEvent::HTTP::format_date +(stat _)[9]; 511 $hdr{"range"} = "bytes=$ofs-"; 512 } 513 514 http_get $url, 515 headers => \%hdr, 516 on_header => sub { 517 my ($hdr) = @_; 518 519 if ($hdr->{Status} == 200 && $ofs) { 520 # resume failed 521 truncate $fh, $ofs = 0; 522 } 523 524 sysseek $fh, $ofs, 0; 525 526 1 527 }, 528 on_body => sub { 529 my ($data, $hdr) = @_; 530 531 if ($hdr->{Status} =~ /^2/) { 532 length $data == syswrite $fh, $data 533 or return; # abort on write errors 534 } 535 536 1 537 }, 538 sub { 539 my (undef, $hdr) = @_; 540 541 my $status = $hdr->{Status}; 542 543 if (my $time = AnyEvent::HTTP::parse_date $hdr->{"last-modified"}) { 544 utime $fh, $time, $time; 545 } 546 547 if ($status == 200 || $status == 206 || $status == 416) { 548 # download ok || resume ok || file already fully downloaded 549 $cb->(1, $hdr); 550 551 } elsif ($status == 412) { 552 # file has changed while resuming, delete and retry 553 unlink $file; 554 $cb->(0, $hdr); 555 556 } elsif ($status == 500 or $status == 503 or $status =~ /^59/) { 557 # retry later 558 $cb->(0, $hdr); 559 560 } else { 561 $cb->(undef, $hdr); 562 } 563 } 564 ; 565 } 566 567 download "http://server/somelargefile", "/tmp/somelargefile", sub { 568 if ($_[0]) { 569 print "OK!\n"; 570 } elsif (defined $_[0]) { 571 print "please retry later\n"; 572 } else { 573 print "ERROR\n"; 574 } 575 }; 576 577 SOCKS PROXIES 578 Socks proxies are not directly supported by AnyEvent::HTTP. You can 579 compile your perl to support socks, or use an external program such as 580 socksify (dante) or tsocks to make your program use a socks proxy 581 transparently. 582 583 Alternatively, for AnyEvent::HTTP only, you can use your own 584 "tcp_connect" function that does the proxy handshake - here is an 585 example that works with socks4a proxies: 586 587 use Errno; 588 use AnyEvent::Util; 589 use AnyEvent::Socket; 590 use AnyEvent::Handle; 591 592 # host, port and username of/for your socks4a proxy 593 my $socks_host = "10.0.0.23"; 594 my $socks_port = 9050; 595 my $socks_user = ""; 596 597 sub socks4a_connect { 598 my ($host, $port, $connect_cb, $prepare_cb) = @_; 599 600 my $hdl = new AnyEvent::Handle 601 connect => [$socks_host, $socks_port], 602 on_prepare => sub { $prepare_cb->($_[0]{fh}) }, 603 on_error => sub { $connect_cb->() }, 604 ; 605 606 $hdl->push_write (pack "CCnNZ*Z*", 4, 1, $port, 1, $socks_user, $host); 607 608 $hdl->push_read (chunk => 8, sub { 609 my ($hdl, $chunk) = @_; 610 my ($status, $port, $ipn) = unpack "xCna4", $chunk; 611 612 if ($status == 0x5a) { 613 $connect_cb->($hdl->{fh}, (format_address $ipn) . ":$port"); 614 } else { 615 $! = Errno::ENXIO; $connect_cb->(); 616 } 617 }); 618 619 $hdl 620 } 621 622 Use "socks4a_connect" instead of "tcp_connect" when doing 623 "http_request"s, possibly after switching off other proxy types: 624 625 AnyEvent::HTTP::set_proxy undef; # usually you do not want other proxies 626 627 http_get 'http://www.google.com', tcp_connect => \&socks4a_connect, sub { 628 my ($data, $headers) = @_; 629 ... 630 }; 631 632SEE ALSO 633 AnyEvent. 634 635AUTHOR 636 Marc Lehmann <schmorp@schmorp.de> 637 http://home.schmorp.de/ 638 639 With many thanks to Дмитрий Шалашов, who provided countless testcases 640 and bugreports. 641 642