Spider tutorial - Factor Documentation

To create a new spider, call the <spider> word with a link to the site you wish to spider.

"https://concatenative.org" <spider>

The max-depth is initialized to 0, which retrieves just the initial page. Let's initialize it to something more fun:

1 >>max-depth

Now the spider will retrieve the first page and all the pages it links to in the same domain.

But suppose the front page contains thousands of links. To avoid grabbing them all, we can set max-count to a reasonable limit.

10 >>max-count

A timeout might keep the spider from hitting the server too hard:

USE: calendar 1.5 seconds >>sleep

Since we happen to know that not all pages of a wiki are suitable for spidering, we will spider only the wiki view pages, not the edit or revisions pages. To do this, we add a filter through which new links are tested; links that pass the filter are added to the todo queue, while links that do not are discarded. You can add several filters to the filter array, but we'll just add a single one for now.

{ [ path>> "/wiki/view" head? ] } >>filters

Finally, to start the spider, call the run-spider word.

run-spider

The full code from the tutorial.

USING: spider calendar sequences accessors ; : spider-concatenative ( -- spider ) "https://concatenative.org" <spider> 1 >>max-depth 10 >>max-count 1.5 seconds >>sleep { [ path>> "/wiki/view" head? ] } >>filters run-spider ;