1
2=head1 NAME
3
4Gungho::Manual::FAQ - Gungho FAQ
5
6=head1 Q. "Why Did You Call It Gungho"?
7
8It rhymes with Xango, which is its predecessor.
9
10=head1 Q. "I don't understand the notation of the config"
11
12To make the notation concise, we use a notation like engine.module = POE.
13Each level is a key in the hash, so the previous example translates to a
14config like
15
16  my $config = {
17    engine => {
18      module => "POE"
19    }
20  }
21
22Or, in YAML:
23
24  engine:
25    module: POE
26
27=head1 Q. "My requests are being served slow. What can I do?"
28
29There are actually a number of things that may affect fetch speed.
30
31=head2 Is Gungho The Right Crawler For Your Data Set?
32
33Gungho uses an asynchronous engine, and with POE::Component::Client::Keepalive
34it reuses the connections to the same host.
35
36This kind of setup works great if you are accessing a lot of diffferent hosts,
37but could easily jam up if you are accessing, for example, a single host.
38For such datasets, Gungho will be no more effective than a simple script
39repeated calls to LWP::UserAgent-E<gt>get().
40
41=head2 Choosing The Right loop_delay With Gungho::Engine::POE
42
43C<engine.config.loop_delay> specifies the number of seconds to wait between
44each "loop" iteration.
45
46A single loop in Gungho is basically (1) check if we have requests in the
47provider, and (2) attempt to fetch that request.
48
49Therefore, if you set this loop_delay to something too low, then you would
50be spending most of the time attempting to fetch a request from the provider
51instead of spending it fetching the request.
52
53On the other hand if you set this too high, you will have to wait until
54Gungho notices that there are pending requests.
55
56There's no general "right" value for this configuration parameter, because
57it largely depends on what kind of dataset you're working with. As a general
58rule of thumb, set it to something sane like 5 seconds, and check out your
59logs to see if that value is an acceptable value.
60
61=head2 Workaround For Jamming Up PoCo::Client::HTTP
62
63When using Gungho::Engine::POE, POE::Component::Client::HTTP is used internally
64to do the actual fetching. Its performance is usually great, but when you
65reach a certain number of requests, it start to jam up and all of the sudden
66your requests will be served very slowly.
67
68This limitation is per-session limit, so Gungho tries to workaround this
69problem by spawning multiple sessions of POE::Component::Client::HTTP.
70
71If you believe this is the cause of the problem, try setting the
72engine.config.spawn to a higher value (the default is 2)
73
74Do note, however, that excessive enqueueing of requests is going to e a
75problem regardless. You should at least keep a mental note of how many requests
76you're sending to the POE queue, and throttle as necessary.
77
78=head2 Considerations When Using A Proxy With Gungho::Engine::POE
79
80Proxies are great, and could be used in crawler applications, but by default
81it doesn't play nicely with Gungho's POE engine.
82
83The short version of the remedy is: Set engine.config.keepalive.keep_alive to 0
84
85  engine:
86    module: POE
87    config:
88      keepalive:
89        keep_alive: 0
90
91Now the long explanation. Gungho::Engine::POE, and POE::Component::Client::HTTP
92which is used internally, uses a module called POE::Component::Client::Keepalive
93to manage the connections, and to possibly reuse the already established
94connection. However, when using a proxy, all the requests go through the given
95proxy, so PoCo::Client::Keepalive will try to reuse the connections to all
96of the requests.
97
98This is obviously aproblem, because it will make the entire request set to
99go through the same connection -- and therefore you lose all parallelism.
100
101To workaround this problem, you need to disable PoCo::Component::Keepalive,
102and hence the above configuration.
103
104=cut