Magpie HTML - v0.1.3
    Preparing search index...

    Magpie HTML - v0.1.3

    Magpie HTML 🦅

    npm version npm downloads CI Documentation License: MIT TypeScript Node.js

    Modern web scraping for when you need the good parts, not the markup soup. Extracts clean article content, parses feeds (RSS, Atom, JSON), and gathers metadata from any page. Handles broken encodings, malformed feeds, and the chaos of real-world HTML. TypeScript-native, works everywhere. Named after the bird known for collecting valuable things... you get the idea.

    Magpie HTML Logo
    • 🎯 Isomorphic - Works in Node.js and browsers
    • 📦 Modern ESM/CJS - Dual format support
    • 🔒 Type-safe - Full TypeScript support
    • 🧪 Well-tested - Built with Node.js native test runner
    • 🚀 Minimal dependencies - Lightweight and fast
    • 🔄 Multi-Format Feed Parser - Parse RSS 2.0, Atom 1.0, and JSON Feed
    • 🔗 Smart URL Resolution - Automatic normalization to absolute URLs
    • 🛡️ Error Resilient - Graceful handling of malformed data
    • 🦅 High-Level Convenience - One-line functions for common tasks
    npm install magpie-html
    
    import { gatherWebsite, gatherArticle, gatherFeed } from "magpie-html";

    // Gather complete website metadata
    const site = await gatherWebsite("https://example.com");
    console.log(site.title); // Page title
    console.log(site.description); // Meta description
    console.log(site.image); // Featured image
    console.log(site.feeds); // Discovered feeds
    console.log(site.internalLinks); // Internal links

    // Gather article content + metadata
    const article = await gatherArticle("https://example.com/article");
    console.log(article.title); // Article title
    console.log(article.content); // Clean article text
    console.log(article.wordCount); // Word count
    console.log(article.readingTime); // Reading time in minutes

    // Gather feed data
    const feed = await gatherFeed("https://example.com/feed.xml");
    console.log(feed.title); // Feed title
    console.log(feed.items); // Feed items

    Extract comprehensive metadata from any webpage:

    import { gatherWebsite } from "magpie-html";

    const site = await gatherWebsite("https://example.com");

    // Basic metadata
    console.log(site.url); // Final URL (after redirects)
    console.log(site.title); // Best title (cleaned)
    console.log(site.description); // Meta description
    console.log(site.image); // Featured image URL
    console.log(site.icon); // Site favicon/icon

    // Language & region
    console.log(site.language); // ISO 639-1 code (e.g., 'en')
    console.log(site.region); // ISO 3166-1 alpha-2 (e.g., 'US')

    // Discovered content
    console.log(site.feeds); // Array of feed URLs
    console.log(site.internalLinks); // Internal links (same domain)
    console.log(site.externalLinks); // External links (other domains)

    // Raw content
    console.log(site.html); // Raw HTML
    console.log(site.text); // Plain text (full page)

    What it does:

    • Fetches the page with automatic redirect handling
    • Extracts metadata from multiple sources (OpenGraph, Schema.org, Twitter Card, etc.)
    • Picks the "best" value for each field (longest, highest priority, cleaned)
    • Discovers RSS/Atom/JSON feeds linked on the page
    • Categorizes internal vs external links
    • Returns normalized, absolute URLs

    Extract clean article content with metadata:

    import { gatherArticle } from "magpie-html";

    const article = await gatherArticle("https://example.com/article");

    // Core content
    console.log(article.url); // Final URL
    console.log(article.title); // Article title (Readability or metadata)
    console.log(article.content); // Clean article text (formatted)
    console.log(article.description); // Excerpt/summary

    // Metrics
    console.log(article.wordCount); // Word count
    console.log(article.readingTime); // Est. reading time (minutes)

    // Media & language
    console.log(article.image); // Article image
    console.log(article.language); // Language code
    console.log(article.region); // Region code

    // Links & raw content
    console.log(article.internalLinks); // Internal links
    console.log(article.externalLinks); // External links (citations)
    console.log(article.html); // Raw HTML
    console.log(article.text); // Plain text (full page)

    What it does:

    • Uses Mozilla Readability to extract clean article content
    • Falls back to metadata extraction if Readability fails
    • Converts cleaned HTML to well-formatted plain text
    • Calculates reading metrics (word count, reading time)
    • Provides both cleaned content and raw HTML

    Parse any feed format with one function:

    import { gatherFeed } from "magpie-html";

    const feed = await gatherFeed("https://example.com/feed.xml");

    // Feed metadata
    console.log(feed.title); // Feed title
    console.log(feed.description); // Feed description
    console.log(feed.url); // Feed URL
    console.log(feed.siteUrl); // Website URL

    // Feed items
    for (const item of feed.items) {
    console.log(item.title); // Item title
    console.log(item.url); // Item URL (absolute)
    console.log(item.description); // Item description
    console.log(item.publishedAt); // Publication date
    console.log(item.author); // Author
    }

    // Format detection
    console.log(feed.format); // 'rss', 'atom', or 'json-feed'

    What it does:

    • Auto-detects feed format (RSS 2.0, Atom 1.0, JSON Feed)
    • Normalizes all formats to a unified interface
    • Resolves relative URLs to absolute
    • Handles malformed data gracefully

    For more control, use the lower-level modules directly:

    import { pluck, parseFeed } from "magpie-html";

    // Fetch feed content
    const response = await pluck("https://example.com/feed.xml");
    const feedContent = await response.textUtf8();

    // Parse with base URL for relative links
    const result = parseFeed(feedContent, response.finalUrl);

    console.log(result.feed.title);
    console.log(result.feed.items[0].title);
    console.log(result.feed.format); // 'rss', 'atom', or 'json-feed'
    import { parseHTML, extractContent, htmlToText } from "magpie-html";

    // Parse HTML once
    const doc = parseHTML(html);

    // Extract article with Readability
    const result = extractContent(doc, {
    baseUrl: "https://example.com/article",
    cleanConditionally: true,
    keepClasses: false,
    });

    if (result.success) {
    console.log(result.title); // Article title
    console.log(result.excerpt); // Article excerpt
    console.log(result.content); // Clean HTML
    console.log(result.textContent); // Plain text
    console.log(result.wordCount); // Word count
    console.log(result.readingTime); // Reading time
    }

    // Or convert any HTML to text
    const plainText = htmlToText(html, {
    preserveWhitespace: false,
    includeLinks: true,
    wrapColumn: 80,
    });
    import {
    parseHTML,
    extractOpenGraph,
    extractSchemaOrg,
    extractSEO,
    } from "magpie-html";

    const doc = parseHTML(html);

    // Extract OpenGraph metadata
    const og = extractOpenGraph(doc);
    console.log(og.title);
    console.log(og.description);
    console.log(og.image);

    // Extract Schema.org data
    const schema = extractSchemaOrg(doc);
    console.log(schema.articles); // NewsArticle, etc.

    // Extract SEO metadata
    const seo = extractSEO(doc);
    console.log(seo.title);
    console.log(seo.description);
    console.log(seo.keywords);

    Available extractors:

    • extractSEO - SEO meta tags
    • extractOpenGraph - OpenGraph metadata
    • extractTwitterCard - Twitter Card metadata
    • extractSchemaOrg - Schema.org / JSON-LD
    • extractCanonical - Canonical URLs
    • extractLanguage - Language detection
    • extractIcons - Favicon and icons
    • extractAssets - All linked assets (images, scripts, fonts, etc.)
    • extractLinks - Navigation links (with internal/external split)
    • extractFeedDiscovery - Discover RSS/Atom/JSON feeds
    • ...and more

    Use pluck() for robust fetching with automatic encoding and redirect handling:

    import { pluck } from "magpie-html";

    const response = await pluck("https://example.com", {
    timeout: 30000, // 30 second timeout
    maxRedirects: 10, // Follow up to 10 redirects
    maxSize: 10485760, // 10MB limit
    userAgent: "MyBot/1.0",
    throwOnHttpError: true,
    strictContentType: false,
    });

    // Enhanced response properties
    console.log(response.finalUrl); // URL after redirects
    console.log(response.redirectChain); // All redirect URLs
    console.log(response.detectedEncoding); // Detected charset
    console.log(response.timing); // Request timing

    // Get UTF-8 decoded content
    const text = await response.textUtf8();

    Why pluck()?

    • Handles broken sites with wrong/missing encoding declarations
    • Follows redirect chains and tracks them
    • Enforces timeouts and size limits
    • Compatible with standard fetch() API
    • Named pluck() to avoid confusion (magpies pluck things! 🦅)

    ⚠️ SECURITY WARNING — Remote Code Execution (RCE)

    swoop() executes remote, third‑party JavaScript inside your current Node.js process (via node:vm + browser shims). This is fundamentally insecure. Only use swoop() on fully trusted targets and treat inputs as hostile by default. For any professional/untrusted scraping, run this in a real sandbox (container/VM/locked-down OS user + seccomp/apparmor/firejail, etc.).

    Note: magpie-html does not use swoop() internally. It’s provided as an optional standalone utility for the few cases where you really need DOM-only client-side rendering.

    swoop() is an explicitly experimental helper that tries to execute client-side scripts against a DOM-only environment and then returns a best-effort HTML snapshot.

    Sometimes curl / fetch / pluck() isn’t enough because the page is a SPA and only renders content after client-side JavaScript runs. swoop() exists to quickly turn “CSR-only” pages into HTML so the rest of magpie-html can work with the result.

    If it works, it can be comparably light and fast because it avoids a full browser engine by using a custom node:vm-based execution environment with browser shims.

    For very complicated targets (heavy JS, complex navigation, strong anti-bot, layout-dependent rendering), you should use a real browser engine instead.

    swoop() is best seen as a building block—you still need to provide the real sandboxing around it.

    • A pragmatic “SPA snapshotter” for cases where a page renders content via client-side JavaScript.
    • No browser engine: no layout/paint/CSS correctness.
    • Not a headless browser replacement (no navigation lifecycle, no reliable layout APIs).
    import { swoop } from "magpie-html";

    const result = await swoop("https://example.com/spa", {
    waitStrategy: "networkidle",
    timeout: 3000,
    });

    console.log(result.html);
    console.log(result.errors);

    Best Practice: Parse HTML once and reuse the document:

    import {
    parseHTML,
    extractSEO,
    extractOpenGraph,
    extractContent,
    } from "magpie-html";

    const doc = parseHTML(html);

    // Reuse the same document for multiple extractions
    const seo = extractSEO(doc); // Fast: <5ms
    const og = extractOpenGraph(doc); // Fast: <5ms
    const content = extractContent(doc); // ~100-500ms

    // Total: One parse + all extractions
    npm install
    
    npm test
    

    The test suite includes both unit tests (*.test.ts) and integration tests using real-world HTML/feed files from cache/.

    npm run test:watch
    
    npm run build
    
    # Check for issues
    npm run lint

    # Auto-fix issues
    npm run lint:fix

    # Format code
    npm run format

    # Run all checks (typecheck + lint)
    npm run check
    npm run typecheck
    

    Generate API documentation:

    npm run docs
    npm run docs:serve

    The cache/ directory contains real-world HTML and feed samples for integration testing. This enables testing against actual production data without network calls.

    npm publish
    

    The prepublishOnly script automatically builds the package before publishing.


    If this package helps your project, consider sponsoring its maintenance:

    GitHub Sponsors


    AnonyfoxMIT License