Skip to content

Configuration

Microwler projects can be written in a single Python module using a declarative approach. Per default, the crawler will visit every qualified link it can find and retrieve the corresponding page content. Of course, you can customize and extend this behaviour. Here's an example:

from microwler import Microwler, scrape, export

selectors = {
    # Using custom XPath expressions:
    'field': '//some/xpath',
    # Using built-in selectors:
    'title': scrape.title,
    # Using Parsel selectors:
    'complex': lambda dom: dom.css('img').xpath('/@src').getall()
}

settings = {
    'exporters': [export.JSONExporter],
    'max_depth': 5
}

def transformer(data: dict):
    # Do something here, i.e. modify/add fields defined in selectors
    return data

crawler = Microwler(
    'START_URL',
    select=selectors,
    transform=transformer,
    settings=settings
)

Create a new project with this template: new <PROJECT_NAME> <START_URL>

You'll find out more about selectors and transformers in the next chapter(s). For now, let's focus on the various settings of the crawler itself.

Settings

You can change your crawler's behaviour using the settings parameter. It holds various configurations for the crawler itself but also for handling exports and such.

Setting Default Description
link_filter //a/@href XPath for link extraction, i.e.
//a[contains(@href, 'blog')]/@href
max_depth 10 The depth limit at which to stop crawling
max_concurrency 20 Maximum number of concurrent requests
dns_providers ['1.1.1.1', '8.8.8.8'] DNS server addresses, i.e. Cloudflare or Google
language 'en-us' Will be used to in the Accept-Language header
caching False Persist results using diskcache
delta_crawl False Drop URLs which have been seen in earlier runs
export_to ${CWD}/projects The folder in which you want to save exported data files
exporters [] A list of export plugins inheriting from microwler.export.BaseExporter