Module grab.spider

class grab.spider.base.Spider(thread_number=None, network_try_limit=None, task_try_limit=None, request_pause=<object object>, priority_mode='random', meta=None, only_cache=False, config=None, args=None, taskq=None, network_result_queue=None, parser_result_queue=None, is_parser_idle=None, shutdown_event=None, mp_mode=False, parser_pool_size=None, parser_mode=False, parser_requests_per_process=10000, http_api_port=None, transport='multicurl', grab_transport='pycurl')[source]

Asynchronous scraping framework.

add_task(task, raise_error=False)[source]

Add task to the task queue.

check_task_limits(task)[source]

Check that task’s network & try counters do not exceed limits.

Returns: * if success: (True, None) * if error: (False, reason)

is_valid_network_response_code(code, task)[source]

Answer the question: if the response could be handled via usual task handler or the task failed and should be processed as error.

load_proxylist(source, source_type=None, proxy_type='http', auto_init=True, auto_change=True)[source]

Load proxy list.

Parameters:
  • source – Proxy source. Accepts string (file path, url) or BaseProxySource instance.
  • source_type – The type of the specified source. Should be one of the following: ‘text_file’ or ‘url’.
  • proxy_type – Should be one of the following: ‘socks4’, ‘socks5’ or’http’.
  • auto_change – If set to True then automatical random proxy rotation will be used.
Proxy source format should be one of the following (for each line):
  • ip:port
  • ip:port:login:password
prepare()[source]

You can do additional spider customization here before it has started working. Simply redefine this method in your Spider class.

prepare_parser()[source]

You can do additional spider customization here before it has started working. Simply redefine this method in your Spider class.

This method is called only from Spider working in parser mode that, in turn, is spawned automatically by main spider proces working in multiprocess mode.

process_grab_proxy(task, grab)[source]

Assign new proxy from proxylist to the task

process_handler_result(result, task)[source]

Process result received from the task handler.

Result could be: * None * Task instance * Data instance. * dict:

  • {type: “stat”, counters: [], collections: []}
  • ResponseNotValid-based exception
  • Arbitrary exception
process_next_page(grab, task, xpath, resolve_base=False, **kwargs)[source]

Generate task for next page.

Parameters:
  • grab – Grab instance
  • task – Task object which should be assigned to next page url
  • xpath – xpath expression which calculates list of URLS
  • **kwargs

    extra settings for new task object

Example:

self.follow_links(grab, 'topic', '//div[@class="topic"]/a/@href')
run()[source]

Main method. All work is done here.

run_parser()[source]

Main work cycle of spider process working in parser-mode.

setup_cache(backend='mongo', database=None, use_compression=True, **kwargs)[source]

Setup cache.

Parameters:
  • backend – Backend name Should be one of the following: ‘mongo’, ‘mysql’ or ‘postgresql’.
  • database – Database name.
  • kwargs – Additional credentials for backend.
setup_queue(backend='memory', **kwargs)[source]

Setup queue.

Parameters:
  • backend – Backend name Should be one of the following: ‘memory’, ‘redis’ or ‘mongo’.
  • kwargs – Additional credentials for backend.
shutdown()[source]

You can override this method to do some final actions after parsing has been done.

start_task_generators()[source]

Process self.initial_urls list and self.task_generator method.

stop()[source]

This method set internal flag which signal spider to stop processing new task and shuts down.

task_generator()[source]

You can override this method to load new tasks smoothly.

It will be used each time as number of tasks in task queue is less then number of threads multiplied on 2 This allows you to not overload all free memory if total number of tasks is big.

task_generator_thread_wrapper(task_generator)[source]

Load new tasks from self.task_generator_object Create new tasks.

If task queue size is less than some value then load new tasks from tasks file.

update_grab_instance(grab)[source]

Use this method to automatically update config of any Grab instance created by the spider.