What is it?

The SiteCrawl API allows you to do most things with your SiteCrawl account. Methods available are listed below. All the responses are in json. The API is rate limited to 1 request per 3 seconds. If you need faster please contact me.

Access

In order to use the API, you must have a SiteCrawl account. You can then find the API key in the 'Account' under the 'API & Tools' section. Each request must send the key with the key set in a request header called API_KEY.

PHP example

$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/tasks");
curl_exec($ch);
curl_close($ch);

Add task

The url for adding a task is http://sitecrawl.net/api/add_task. To add the task you need to post the following fields. This will return an id on success, and error message on failure. The id can be used for task information and reporting.

Request paramaters

  • name : Name of the task.
  • start_url : Url you want the crawl to start from. It wont go below this folder
  • max_depth : How many links to crawl away from the start url
  • max_links : (optional) How many links to crawl
  • crawl_no_follow : (optional, defaults to true) If the crawler should crawl no follow links. (either 0 for false, or 1 for true)
  • speed : (optional, defaults to slow) Speed of the crawl. This only works if you are owner of the domain (see account area)

PHP examples

//Create a crawl, checking the whole site
$ch = curl_init();
$post_fields = array('name' => 'SiteCrawl', 'start_url' => 'http://sitecrawl.net/', 'max_depth' => 5);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/add_task");
curl_exec($ch);
curl_close($ch);

//Create a crawl, checking single pages links
$ch = curl_init();
$post_fields = array('name' => 'SiteCrawl', 'start_url' => 'http://sitecrawl.net/', 'max_depth' => 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/add_task");
curl_exec($ch);
curl_close($ch);

//Create a crawl, 100 links only, dont crawl nofollow
$ch = curl_init();
$post_fields = array('name' => 'SiteCrawl', 'start_url' => 'http://sitecrawl.net/', 'max_depth' => 5, 'max_links' => 100, 'crawl_no_follow' => 0);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/add_task");
curl_exec($ch);
curl_close($ch);

Getting tasks

You can access you tasks by calling the url http://sitecrawl.net/api/tasks. This will return 25 tasks, if you can have more then this, the simply specify an offset of 25 to get the next 25. The offset is applied to the end of the url like so; http://sitecrawl.net/api/tasks/25. See the code under Access for an example. This will return an array of crawls in the users account.

Response paramaters

  • id : The task id, this can be used with the task and task_query functions.
  • created : Unix timestamp of when the crawl was created
  • name : Name give by the user for the crawl
  • start_url : Start url given by the user
  • status : Current status of the crawl. Pending, InProgress, Failed or Complete
  • crawled_count : How many urls have been crawled so far
  • total_links : Unique links found to be crawled
  • response_ok : How many urls crawled which had a good response code
  • response_bad : How many urls crawled which had a bad response code
  • blocked_bots : How many urls were blocked by robots
  • redirects : How many urls which redirected to another page
  • nofollow : How many links contained a 'nofollow' attribute

Task information

This returns the same information as the /tasks/ call, however it's limited to just getting the data of the task_id that you pass. This means its faster to look up, and you don't need to iterate over the response. Simply call http://sitecrawl.net/api/task/12345 where the number on the end represents your task id.

Task search / report

The task search contains 3 pre-made reports, or you can create your own search based on a number of paramaters. All of the items start with at http://sitecrawl.net/api/task_report/12345 where the number on the end relates to the task id. All of the requests will return an array of urls.

You can then specify /broken-external or /broken-internal or /blocked-robots to get results for those predefined reports. If you don't specify one of those, then it will return all the urls, which you can filter by POSTing special variables to the url. The filters below only work if you haven't specifed a pre-defined report.

Search filter paramaters

  • domain : Value for the domain field
  • domain_mod : How to use the domain value (is, is_not, contains, doesnt_contain)
  • upath : Value for the path field
  • upath_mod : How to use the upath value (is, is_not, contains, doesnt_contain)
  • status_code : Value for the status code
  • status_code_mod : How to use the status_code value (is, is_not, greater_than, less_than)
  • response_time : Value for the response time of the url
  • response_time_mod : How to use the response_time value (is, is_not, greater_than, less_than)
  • anchor_text_to : Value for the anchor text to the url
  • anchor_text_to_mod : How to use the anchor text value (is, is_not, contains, doesnt_contain)
  • is_nofollow : Returns values which have a nofollow attribute pointing to them
  • is_nofollow_mod : How to use the is_nofollow value (0, 1)

Response fields

  • id : Internal use
  • url_id : id of the url, used for fetching more information about the links to/from
  • scheme : What scheme the url is, http/https/ftp etc
  • hostname : Domain portion of the url
  • path : Parth of the url, this will include any query strings
  • status_code : Status code the page returned on crawl Blocked by reports will be 1
  • request_time : How long in seconds it took to fetch the url. Includes dns lookup time
  • links_in : How many links have been found pointing to this page
  • links_out : How many urls where found on this page
  • url : scheme, hostname and path put together with / as needed

PHP examples

//Fetch all the broken internal links
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/task_report/task_id_here/broken-internal");
curl_exec($ch);
curl_close($ch);

//Fetch all links with anchor text of click here
$ch = curl_init();
$post_fields = array('anchor_text_to' => 'click here', 'anchor_text_to_mod' => 'is');
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/task_report/task_id_here");
curl_exec($ch);
curl_close($ch);

//Fetch all links which are 404 and are in the forum
$ch = curl_init();
$post_fields = array('status_code' => '404', 'status_code_mod' => 'is', 'upath' => '/forum', 'upath_mod' => 'contains');
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/task_report/task_id_here");
curl_exec($ch);
curl_close($ch);

Linking report

With this call you can get which urls are linking to or from a page, along with the anchor text that was used in the link. http://sitecrawl.net/api/linking_report/direction/12345/7890 where 12345 is the url id, and 7890 is the task id. The direction can either be to or from. The results will be given in sets of 10, but you can specify an offset on the end of the url to get the next 10.

Response fields

  • url_id : id of the url that is linking to/from the requested url
  • scheme : What scheme the url is, http/https/ftp etc
  • hostname : Domain portion of the url
  • path : Parth of the url, this will include any query strings
  • is_nofollow : If the url has a nofollow attribute. Either 0 (false) or 1 (true)
  • url : scheme, hostname and path put together with / as needed

PHP example

//Fetch all links which are being linked from a particular page
//Note: get the next 10 by changing the url to http://sitecrawl.net/api/linking_report/from/{$url_id}/{$task_id}/10
$ch = curl_init();
curl_setopt($ch, CURLOPT_HTTPHEADER, array('API_KEY: your_key_here'));
$task_id = 1234;
$url_id = 987;
curl_setopt($ch, CURLOPT_URL, "http://sitecrawl.net/api/linking_report/from/{$url_id}/{$task_id}");
curl_exec($ch);
curl_close($ch);