Skip to content

decruft - A Rust Readability Tool

decruft - A Rust Readability Tool

I read a lot on the web. I save a lot of articles to read later. And I’ve spent the last few years watching the web get harder to actually read. Newsletter popups. Cookie banners. Subscribe-to-read modals. Right rails that scroll faster than you do. Authors hidden behind eight tags, four share buttons, and an estimated read time. The actual words I came for, somewhere underneath all of it.

I’ve been a happy user of Readability.js for years, and more recently Steph Ango’s defuddle, which is a TypeScript library that does a particularly good job of stripping the noise. defuddle is a bit more opinionated than Readability and friends, which I like; it makes stronger calls about what is and isn’t part of an article. defuddle has a CLI but it’s a Node tool, and I wanted something I could install with cargo install, drop into shell pipelines, and call from Rust services without dragging Node along for the ride. So I wrote one. It’s called decruft.

It does what it says on the tin. Point it at a URL or some HTML, and it gives you back the article body and the metadata, with the cruft removed.

What you get

A URL on the way in:

Terminal window
$ decruft https://www.kartar.net/posts/how-will-engineers-learn | jq '{title, author, word_count}'
{
"title": "How Will New Engineers Learn to Code?",
"author": "James Turnbull",
"word_count": 1410
}

URLs are auto-detected and fetched. You can also pass a local HTML file or pipe HTML in on stdin. Output is JSON by default, or pick -f html for clean HTML, -f text for plain text, or -f markdown for a markdown article.

The metadata fields it tries to extract are the usual suspects: title, author, published date, description, image, language, site name, favicon, canonical URL, word count. Each one is pulled through a fallback chain (Open Graph, Twitter cards, schema.org JSON-LD, then DOM heuristics), so that if one source is missing you still get something useful. Fields it can’t find are absent from the output rather than empty strings, which matters more than it sounds. Empty strings have a habit of becoming bug reports.

As a library it looks like this:

use decruft::{parse, DecruftOptions};
let mut options = DecruftOptions::default();
options.url = Some("https://example.com/article".into());
let result = parse(html, &options);
println!("Title: {:?}", result.title);
println!("Author: {:?}", result.author);
println!("Content: {}", result.content);

There’s a fetch_page helper too, with browser-like defaults, in case you want to skip the curl dance.

How the cruft gets removed

The pipeline is the interesting bit. Roughly, it goes:

  1. Parse the HTML and pull out any schema.org JSON-LD.
  2. Extract the metadata using those fallback chains.
  3. Try site-specific extractors. There are dedicated paths for GitHub issues and PRs, Hacker News, Reddit, Stack Overflow, Lobsters, Substack, C2 Wiki, and X/Twitter, with API fallbacks, in some cases and with varying levels of reliability, for when the page you fetched is really just a JavaScript shell.
  4. Score the rest of the DOM and pick the best content root.
  5. Standardise the awkward stuff (footnotes, callouts, code blocks, math) into a canonical shape.
  6. Remove ads, navigation, sidebars, share buttons, comments, related posts, newsletter signups, cookie banners, and other noise using CSS selectors and partial class and id patterns.
  7. Score what’s left and prune the link-dense or boilerplate-looking blocks.
  8. Normalise the output: clean attributes, resolve relative URLs, deduplicate images.
  9. If the result is too short, relax the filters and try again.

I’ve found that content extraction is a balancing act between cutting too much (you lose paragraphs from short posts) and cutting too little (the page comes back with a newsletter widget still bolted onto it). The retry-with-relaxed-filters approach is borrowed from defuddle and works well’ish in practice.

Where it falls down

decruft works on the HTML you give it. It is not a browser engine.

That means fully client-rendered sites are the main weak spot: pages that ship a mostly empty <body> and build the article with JavaScript after load. For those, you need to render the page first with something like Playwright or Puppeteer, then pass the rendered HTML to decruft.

Hard paywalls are also out of scope. decruft does not try to bypass access controls. If the server does not send the article body in the HTML, there is nothing for it to extract.

And content extraction is still heuristic. Some sites have unusual markup, aggressive overlays, malformed HTML, or article layouts that look like boilerplate. I’m collecting failing pages as test fixtures, so if decruft mangles a page, issues with an HTML fixture and expected output are useful.

Try it

Terminal window
cargo install decruft
decruft https://your-favourite-bloated-news-site.example/article -f markdown

The source is on GitHub, the crate is on crates.io, and the API docs are on docs.rs. MIT licensed. It is very new, and built mostly to solve a problem for me, so feedback is welcome, especially sites where extraction fails or produces bad output. Issues and pull requests are best if they come with an HTML fixture and a “this is what it should look like” expected output.