JT's Weblog

Web Scraping Helper

published: November 12, 2025 estimate: 4 min read view-cnt: 25 views

Here comes the tasty web scraping helper!

I briefly mentioned this in my previous article Weekly Review W45, section “About The Web Scraping Helper”

The motivation is to provide a simple, clean way to scrape data from a page without the heavy lifting. No need to install packages or run automation tools like Playwright or Selenium. Just open your devtools and run the code snippet, or even faster, click on your bookmarklet
Then, select two visually consecutive elements—Voilà! Your desired data is now in the clipboard

This simple JavaScript solution grew out of my pursuit of a better browsing experience, where I compose simple snippets and store them as bookmarklets or Tampermonkey scripts. (It’s also the product of my efforts to figure out what makes XPath better than CSS selectors.)

Enough talking. Let’s try it out!
Drag it to your bookmarks bar —> The Bookmarklet

Usage

  1. Click the bookmarklet to initialize all the functions
  2. Press Ctrl + Shift and move your cursor around
    • Hovered elements will be highlighted
    • Release Ctrl + Shift to disable the hover highlight
  3. Click on two visually consecutive elements
    • Clicked items are automatically stored in global variables $0 and $1 ($0 is the most recent one)
    • A simple algorithm runs to create a generalized XPath pattern that matches both elements and their pattern siblings
    • The text from all matching elements is then extracted and copied to clipboard
  4. Double-click anywhere on the DOM to reset to the initial state

Technical Discussion

One of the challenges is porting console utilities from “console context” to “page context.”

If you run JavaScript from anywhere other than the console, you don’t have access to functions or variables such as $x, copy, $0, and $1.

Thanks to LLMs, those functions are not too difficult to port.

The second challenge is the dynamic nature of the DOM—it’s possible that two visually consecutive elements don’t share the same XPath.

In short, if the two elements have the same segment length, the algorithm replaces the differing part with an asterisk symbol /lv1/*/lv3, then uses this XPath to query all pattern siblings. (Both elements have identical segments lv1 and lv3; the only difference is in the middle—e.g., it could be lv2[3] for the first one and lv2[4] for the other.) If the two have different segment lengths, the algorithm searches both ends of the XPath, finds the common prefix and suffix, then puts them together. For example, given element 1 with XPath value /lv1/lv2[1]/lv3/lv4 and element 2 with XPath value /lv1/lv2[3]/xxx/zzz/lv3/lv4, the algorithm will use XPath /lv1/lv2//lv3/lv4 to query all pattern siblings. This works as long as the two elements share an ancestor container with some common prefix and suffix structure.

However, the tool struggles when pattern siblings are split across different containers. In these cases, the algorithm casts too wide a net, finding a higher-level ancestor that captures unwanted elements along with your target data. A proper solution would require the algorithm to recognize each container and parse its pattern siblings separately. Until then, you’ll need multiple passes or heavier-duty scraping tools.

Final Thought

The idea of scraping data like a ninja has been on my mind for a while.

It’s quite fulfilling to bring this idea to life.

The journey doesn’t end here. It’s possible to fine-tune the algorithm to catch more use cases, publish the tool as a browser extension to skip the initialization step, or compare this method with the trending MCP/Playwright approach to see which one wins the ninja title.

All these tasks are doable; however, I’ll leave them to my future self.

For now, I’m planning to kick off a game project featuring VIM keys with ninja spirit (ninja FTW! 🤣) This project will also introduce a few more topics including Blazor, DDD, and Azure, which are new to this blog.

Hope you find the bookmarklet useful, and stay tuned for more upcoming .NET content! 😎



No comments yet

Be the first to comment!