Tools: Crawl v1
The crawl
API endpoint allows you to extract content from web pages. This is particularly useful when you need to gather information from websites for your AI applications.
The crawling functionality is powered by Spider.cloud.
Limitations
- Maximum number of URLs per request: 100
- Maximum crawl depth (pages): 50
Pre-requisites
- Langbase API Key: Generate your API key from the User/Org API key documentation.
- Spider.cloud API Key: Sign up at Spider.cloud to get your crawler API key.
Crawl web pages
Extract content from web pages by sending URLs to the crawl API endpoint.
Headers
- Name
Content-Type
- Type
- string
- Required
- Required
- Description
Request content type. Needs to be
application/json
.
- Name
Authorization
- Type
- string
- Required
- Required
- Description
Replace
<YOUR_API_KEY>
with your user/org API key.
- Name
LB-CRAWL-KEY
- Type
- string
- Required
- Required
- Description
Your Spider.cloud API key – obtain one from Spider.cloud. Replace
YOUR_SPIDER_CLOUD_API_KEY
with your key.
Request Body
- Name
url
- Type
- string[]
- Required
- Required
- Description
An array of URLs to crawl. Each URL should be a valid web address. Maximum 100 URLs per request.
- Name
maxPages
- Type
- number
- Description
The maximum number of pages to crawl. This limits the depth of the crawl operation. Maximum value: 50.
Usage example
Install the SDK
npm i langbase
Environment variables
.env file
LANGBASE_API_KEY="<USER/ORG-API-KEY>"
CRAWL_KEY="<SPIDER-CLOUD-API-KEY>"
Crawl web pages
Crawling
import { Langbase } from 'langbase';
const langbase = new Langbase({
apiKey: process.env.LANGBASE_API_KEY!,
});
async function main() {
const results = await langbase.tools.crawl({
url: ['https://example.com'],
apiKey: process.env.CRAWL_KEY!, // Spider.cloud API key
maxPages: 5
});
console.log('Crawled content:', results);
}
main();
Response
- Name
ToolCrawlResponse[]
- Type
- Array<object>
- Description
An array of objects containing the URL and the extracted content returned by the crawl operation.
Crawl API Response
interface ToolCrawlResponse { url: string; content: string; } type CrawlResponse = ToolCrawlResponse[];
- Name
url
- Type
- string
- Description
The URL of the crawled page.
- Name
content
- Type
- string
- Description
The extracted content from the crawled page.
API Response
[
{
"url": "https://example.com/page1",
"content": "Extracted content from the webpage..."
},
{
"url": "https://example.com/page2",
"content": "More extracted content..."
}
]