Configuration

Microwler projects can be written in a single Python module using a declarative approach. Per default, the crawler will visit every qualified link it can find and retrieve the corresponding page content. Of course, you can customize and extend this behaviour. Here's an example:

from microwler import Microwler, scrape, export

selectors = {
    # Using custom XPath expressions:
    'field': '//some/xpath',
    # Using built-in selectors:
    'title': scrape.title,
    # Using Parsel selectors:
    'complex': lambda dom: dom.css('img').xpath('/@src').getall()
}

settings = {
    'exporters': [export.JSONExporter],
    'max_depth': 5
}

def transformer(data: dict):
    # Do something here, i.e. modify/add fields defined in selectors
    return data

crawler = Microwler(
    'START_URL',
    select=selectors,
    transform=transformer,
    settings=settings
)

Create a new project with this template: new <PROJECT_NAME> <START_URL>

You'll find out more about selectors and transformers in the next chapter(s). For now, let's focus on the various settings of the crawler itself.

Settings

You can change your crawler's behaviour using the settings parameter. It holds various configurations for the crawler itself but also for handling exports and such.

Setting	Default	Description
link_filter	`//a/@href`	XPath for link extraction, i.e. `//a[contains(@href, 'blog')]/@href`
max_depth	10	The depth limit at which to stop crawling
max_concurrency	20	Maximum number of concurrent requests
dns_providers	`['1.1.1.1', '8.8.8.8']`	DNS server addresses, i.e. Cloudflare or Google
language	'en-us'	Will be used to in the `Accept-Language` header
caching	`False`	Persist results using `diskcache`
delta_crawl	`False`	Drop URLs which have been seen in earlier runs
export_to	`${CWD}/projects`	The folder in which you want to save exported data files
exporters	`[]`	A list of export plugins inheriting from microwler.export.BaseExporter