Tools: Crawl v1

The crawl API endpoint allows you to extract content from web pages. This is particularly useful when you need to gather information from websites for your AI applications.

The crawling functionality is powered by Spider.cloud.


Limitations

  • Maximum number of URLs per request: 100
  • Maximum crawl depth (pages): 50

Pre-requisites

  1. Langbase API Key: Generate your API key from the User/Org API key documentation.
  2. Spider.cloud API Key: Sign up at Spider.cloud to get your crawler API key.


POST/v1/tools/crawl

Crawl web pages

Extract content from web pages by sending URLs to the crawl API endpoint.

Headers

  • Name
    Content-Type
    Type
    string
    Required
    Required
    Description

    Request content type. Needs to be application/json.

  • Name
    Authorization
    Type
    string
    Required
    Required
    Description

    Replace <YOUR_API_KEY> with your user/org API key.

  • Name
    LB-CRAWL-KEY
    Type
    string
    Required
    Required
    Description

    Your Spider.cloud API key – obtain one from Spider.cloud. Replace YOUR_SPIDER_CLOUD_API_KEY with your key.


Request Body

  • Name
    url
    Type
    string[]
    Required
    Required
    Description

    An array of URLs to crawl. Each URL should be a valid web address. Maximum 100 URLs per request.

  • Name
    maxPages
    Type
    number
    Description

    The maximum number of pages to crawl. This limits the depth of the crawl operation. Maximum value: 50.

Usage example

Install the SDK

npm i langbase

Environment variables

.env file

LANGBASE_API_KEY="<USER/ORG-API-KEY>"
CRAWL_KEY="<SPIDER-CLOUD-API-KEY>"

Crawl web pages

Crawling

POST
/v1/tools/crawl
import { Langbase } from 'langbase';

const langbase = new Langbase({
	apiKey: process.env.LANGBASE_API_KEY!,
});

async function main() {
	const results = await langbase.tools.crawl({
		url: ['https://example.com'],
		apiKey: process.env.CRAWL_KEY!, // Spider.cloud API key
		maxPages: 5
	});

	console.log('Crawled content:', results);
}

main();

Response

  • Name
    ToolCrawlResponse[]
    Type
    Array<object>
    Description

    An array of objects containing the URL and the extracted content returned by the crawl operation.

    Crawl API Response

    interface ToolCrawlResponse {
      url: string;
      content: string;
    }
    
    type CrawlResponse = ToolCrawlResponse[];
    
    • Name
      url
      Type
      string
      Description

      The URL of the crawled page.

    • Name
      content
      Type
      string
      Description

      The extracted content from the crawled page.

API Response

[
  {
    "url": "https://example.com/page1",
    "content": "Extracted content from the webpage..."
  },
  {
    "url": "https://example.com/page2",
    "content": "More extracted content..."
  }
]