Content Authentication & Headers
Configure authentication and custom headers for extracting content from protected pages, internal systems, and sites requiring specific browser identification.
Authentication Options
HTTP Basic Authentication (httpAuth
)
- Default: None
- Description: Provides HTTP Basic Authentication credentials
- Format:
base64url(username:password)
- Example:
httpAuth=YWRtaW46c2VjcmV0MTIz
Custom User Agent (userAgent
)
- Default: Standard Chrome user agent
- Description: Override the browser's user agent string
- Format: Base64URL encoded string
- Example:
userAgent=TW96aWxsYS81LjAgKGN1c3RvbSBib3Qp
Usage Examples
Basic Authentication
https://cdn.capture.page/KEY/HASH/content?url=https://protected.docs.com&httpAuth=YWRtaW46cGFzc3dvcmQ
Custom User Agent
https://cdn.capture.page/KEY/HASH/content?url=https://api-docs.com&userAgent=TW96aWxsYS81LjAgKGJvdCkgQXBwbGVXZWJLaXQ
Combined Authentication
https://cdn.capture.page/KEY/HASH/content?url=https://internal.wiki.com&httpAuth=YWRtaW46c2VjcmV0&userAgent=Q29tcGFueUJvdA
HTTP Basic Authentication
Encoding Credentials
JavaScript
function encodeAuth(username, password) {
const credentials = `${username}:${password}`;
return btoa(credentials)
.replace(/\+/g, '-')
.replace(/\//g, '_')
.replace(/=/g, '');
}
// Example usage
const auth = encodeAuth('admin', 'secret123');
// Result: YWRtaW46c2VjcmV0MTIz
Python
import base64
def encode_auth(username, password):
credentials = f"{username}:{password}"
encoded = base64.urlsafe_b64encode(credentials.encode())
return encoded.decode().rstrip('=')
auth = encode_auth('admin', 'secret123')
Command Line
echo -n "admin:secret123" | base64 | tr '+/' '-_' | tr -d '='
Common Use Cases
Internal Documentation
// Company wiki content
&httpAuth=ZW1wOndpa2lhY2Nlc3M&url=https://wiki.company.internal
// API documentation
&httpAuth=ZGV2OmRvY3NyZWFk&url=https://docs.api.internal
// Technical guides
&httpAuth=dGVjaDpndWlkZXM&url=https://guides.company.com
CMS Content Extraction
// WordPress admin content
&httpAuth=YWRtaW46d3BhZG1pbg&url=https://cms.site.com/admin
// Drupal protected content
&httpAuth=ZWRpdG9yOmNtc2VkaXQ&url=https://drupal.site.com
// Custom CMS
&httpAuth=Y21zOnVzZXI&url=https://custom.cms.com
Knowledge Bases
// Confluence spaces
&httpAuth=dXNlcjpjb25mbHVlbmNl&url=https://company.atlassian.net
// Notion pages
&httpAuth=bm90aW9uOnRva2Vu&url=https://notion.site/private-page
// Internal knowledge systems
&httpAuth=a25vd2xlZGdlOnJlYWQ&url=https://kb.internal.com
User Agent Configuration
Why Customize User Agent
- Access Control: Some systems restrict based on user agent
- Content Variation: Different content served to different agents
- Bot Identification: Identify content extraction requests
- API Documentation: Specific requirements for documentation sites
Common User Agents
Documentation Crawlers
// Generic documentation bot
const docBot = 'DocumentationBot/1.0 (Content Extraction)';
// Company-specific crawler
const companyBot = 'CompanyName-ContentBot/2.0 (+https://company.com/bot)';
// API documentation crawler
const apiBot = 'API-DocumentationCrawler/1.0';
Standard Browsers
// Latest Chrome
const chrome = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36';
// Firefox
const firefox = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0';
// Mobile Safari
const mobileSafari = 'Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Mobile/15E148 Safari/604.1';
Encoding User Agents
function encodeUserAgent(userAgent) {
return btoa(userAgent)
.replace(/\+/g, '-')
.replace(/\//g, '_')
.replace(/=/g, '');
}
// Example
const encodedUA = encodeUserAgent('MyContentBot/1.0');
Content-Specific Authentication
API Documentation
// Swagger/OpenAPI docs
const swaggerAuth = encodeAuth('api', 'docs-reader');
const swaggerUA = encodeUserAgent('SwaggerBot/1.0');
// Postman documentation
const postmanAuth = encodeAuth('docs', 'public-reader');
Knowledge Management
// Confluence authentication
const confluenceAuth = encodeAuth('content-reader', 'confluence-pass');
const confluenceUA = encodeUserAgent('ConfluenceContentBot/1.0');
// SharePoint content
const sharepointAuth = encodeAuth('sp-user', 'sp-content-pass');
CMS Content
// WordPress content extraction
const wpAuth = encodeAuth('content-api', 'wp-rest-key');
const wpUA = encodeUserAgent('WordPressContentBot/1.0');
// Drupal API access
const drupalAuth = encodeAuth('api-user', 'drupal-api-key');
Advanced Authentication Scenarios
Role-Based Content Access
// Different content for different roles
const roleAuth = {
public: null, // No auth needed
member: encodeAuth('member', 'member-pass'),
premium: encodeAuth('premium', 'premium-pass'),
admin: encodeAuth('admin', 'admin-pass')
};
function getContentByRole(url, role) {
const auth = roleAuth[role];
return auth ? `${url}&httpAuth=${auth}` : url;
}
Multi-Domain Authentication
// Different credentials for different domains
const domainAuth = {
'docs.company.com': encodeAuth('docs', 'docs-pass'),
'wiki.company.com': encodeAuth('wiki', 'wiki-pass'),
'api.company.com': encodeAuth('api', 'api-pass')
};
function getAuthForDomain(url) {
const domain = new URL(url).hostname;
return domainAuth[domain];
}
Environment-Based Authentication
// Different auth per environment
const envAuth = {
development: encodeAuth('dev', process.env.DEV_CONTENT_PASS),
staging: encodeAuth('stage', process.env.STAGE_CONTENT_PASS),
production: encodeAuth('prod', process.env.PROD_CONTENT_PASS)
};
Security Best Practices
1. Credential Management
// Use environment variables
const username = process.env.CONTENT_AUTH_USER;
const password = process.env.CONTENT_AUTH_PASS;
const auth = encodeAuth(username, password);
// Rotate credentials regularly
async function rotateContentCredentials() {
const newPass = generateSecurePassword();
await updateContentPassword(username, newPass);
return encodeAuth(username, newPass);
}
2. Access Control
// Implement read-only access
const readOnlyAuth = encodeAuth('content-reader', 'readonly-pass');
// Scope access to specific content types
const scopedAuth = {
docs: encodeAuth('docs-reader', 'docs-pass'),
api: encodeAuth('api-reader', 'api-pass'),
wiki: encodeAuth('wiki-reader', 'wiki-pass')
};
3. Audit and Monitoring
// Log content access
function logContentAccess(url, auth, success) {
console.log({
timestamp: new Date().toISOString(),
url: url,
authenticated: !!auth,
success: success,
userAgent: 'ContentBot/1.0'
});
}
Integration Examples
Documentation Archival
async function archiveDocumentation() {
const docSites = [
{
url: 'https://docs.internal.com',
auth: encodeAuth('archive', 'docs-backup'),
name: 'internal-docs'
},
{
url: 'https://api.docs.com',
auth: encodeAuth('api-backup', 'api-archive'),
name: 'api-docs'
}
];
for (const site of docSites) {
const content = await extractContent(site.url, site.auth);
await saveContent(site.name, content);
}
}
Content Migration
async function migrateContent() {
const sources = [
{ url: 'https://old-wiki.com', auth: oldWikiAuth },
{ url: 'https://legacy-docs.com', auth: legacyAuth }
];
for (const source of sources) {
const content = await extractContent(source.url, source.auth);
await importToNewSystem(content);
}
}
Content Monitoring
async function monitorContentChanges() {
const monitoredSites = [
{ url: 'https://competitor-docs.com', auth: null },
{ url: 'https://industry-wiki.com', auth: industryAuth }
];
for (const site of monitoredSites) {
const currentContent = await extractContent(site.url, site.auth);
const hasChanged = await compareWithPrevious(currentContent);
if (hasChanged) {
await notifyContentChange(site.url);
}
}
}
Troubleshooting
Authentication Failures
401 Unauthorized
// Verify credential encoding
const testAuth = 'YWRtaW46c2VjcmV0';
const decoded = atob(testAuth.replace(/-/g, '+').replace(/_/g, '/'));
console.log('Credentials:', decoded); // Should show username:password
// Test credentials manually
// Use browser or curl to verify access
403 Forbidden
- Check user permissions
- Verify IP allowlisting
- Additional headers may be required
- Rate limiting may be active
Content Access Issues
Partial Content
// Wait for dynamic content
&httpAuth=YWRtaW46cGFzcw&delay=3&waitFor=.content-loaded
// Use specific selectors
&waitFor=.main-content[data-loaded="true"]
Missing Content
// Verify authentication scope
// Check if content requires JavaScript
// Ensure proper user agent for content type
Best Practices
1. Use Descriptive User Agents
// Good - Identifies purpose and contact
const userAgent = 'CompanyContentBot/1.0 (+https://company.com/bot)';
// Avoid - Generic or misleading
const userAgent = 'Mozilla/5.0'; // Pretends to be browser
2. Implement Proper Error Handling
async function extractContentSafely(url, auth) {
try {
return await extractContent(url, auth);
} catch (error) {
if (error.status === 401) {
await refreshCredentials();
return await extractContent(url, auth);
}
throw error;
}
}
3. Respect Rate Limits
// Add delays between requests
async function extractMultiplePages(urls, auth) {
const results = [];
for (const url of urls) {
const content = await extractContent(url, auth);
results.push(content);
// Respect server limits
await sleep(1000);
}
return results;
}
See Also
- Content Timing - Ensure content is loaded before extraction
- Screenshot Authentication - Similar auth for screenshots
- PDF Authentication - Authentication for PDF generation