Online Searcher

Written by Nexcerpt on May 1st, 2013 in Patterns & News.

This text rendering of News URLs Tell Their own Stories is reprinted from the May – June 2013 Online Searcher (Volume 37, Number 03). A PDF of the print layout, as published, is available at Nexcerpt.

I’ve taken the liberty of adding several meme-based images I had planned to make part of the original article. The editors of Online Searcher wisely questioned whether such memes may be copyright protected. After some research of meme copyright, I concluded that no creature has a lifespan long enough to sort out such a convoluted jumble of claims and counterclaims. So, the images did not appear in the print edition of Online Searcher. I’ve included them here, though, because (after processing over 200 million URLs) I find them funny, and hey… it’s just a blog ;-)

News URLs Tell Their Own Story
Questions about validity, reliability, and authority of news items continue to plague information professionals, particularly as the news cycle becomes ever faster. I’ve discovered that it’s increasingly feasible to assess the relative value of an online news item from characteristics of its URL. Many URLs display value hints; this trend is growing. Certain hints, especially by their absence, suggest the URL (or even its source) has lower value for professional audiences.

URLs often provide first impressions of a news item or website. URL elements (such as formats, dates, sequences, or appended arguments), may produce expectations about the source. This trend in URL messaging matters to those who seek online news and who produce it.

Since 2001, Nexcerpt (nexcerpt.com) has scanned billions of URLs, and analyzed hundreds of millions, across thousands of news, media, corporate, and government domains. The system was designed to derive rules for evaluating the relative value of news items based primarily on the source’s URLs—rules surprisingly applicable to novel sources.

This took considerable effort to discover. Let me save you some time.

One does not simply locate a resource

IF URLS COULD TALK
URLs communicate more than you realize. In fact, they can talk to you about quality.

Here’s one very recognizable URL pattern:

   /2001/09/11/

While recognizable, that particular URL pattern was not commonly used in 2001. The rise of blogging platforms, which often embed “/YYYY/MM/DD/” in URLs, may be what made that mercifully logical (and sortable) format so popular today. (It now appears in 15% of URLs scanned by Nexcerpt.)

These preferred URL patterns arise naturally, by silent agreement. Early, methodical naming schemes were inevitable. Resource managers and programmers reused good schemes; others copied success. URLs containing “/1995/” led to content from that year. If the URL included “?id=555″ then “?id=556″ would appear next. All this seemed obvious.

Less obvious, though, was that the rigor of URL patterns, and the thinking it reflected, became increasingly correlated with the value (including authority and quality) of associated content. Solid content sources hire solid information professionals who produce solid schemes—quality attracts quality. But even in the best systems, implementation details offer insight into the character and priorities of the sponsor.

Especially over time, URL patterns reflect the standards of an organization, its offline resources, and business practices. Many URLs provide hints of goals or missions; some reveal disarray in a technical realm, or an entire organization.

NO DIGITS
Digit-free URLs suggest lower value content. Some URL assignment schemes focus on search optimization, using strings of keywords containing few (or no) dates or numbers. Nexcerpt has noted apparent correlations between such digit-free URLs and content of lower value, at least to our professional clients.

Nexcerpt’s client list is business-oriented, though it also contains some non-profits. Most focus on strategic matters in law, research, technology, marketing, or public policy. Their Nexcerpt accounts provide up to ten Boolean queries of any length. (Our longest query is over 1,300 characters; the most complex contains over 130 terms.) Given that degree of control, excerpts that match queries, and reach accounts, are likely to be of high value.

When the URL contains no digits, the associated content is less likely to match client queries. That is, an absence of URL digits is a predictor of low value content. Why should this be, for “keyword optimized” URLs?

We considered whether sources using digit-free URLs (rather than URLs themselves) might be culpable. Some are magazines familiar from your local grocery store checkout counter; others promote a (mostly conservative) political agenda; many report only local news. Yes, faddish, political, or hyperlocal sources offer fewer items of value to strategic planners. We scan some anyway, to broaden reporting on social trends—one of which is that they date or number less content!

However, URLs containing digits from those sources (when they provide such URLs) are excerpted at higher rates. URLs containing digits tend to point to articles of higher value (by client keyword relevance), even within a single domain.

We’re left wondering why. Do systems pursue a (dubious) theory that digits in URLs harm SEO? Do writers on political topics prefer not to date their work? Or are fluff pieces given fluff URLs?

CONVEYING MEANING
Digits and associated arguments convey meaning. Common patterns notwithstanding, online news URLs remain as diverse as the domains they inhabit. Many government entities, corporate groups, publishing families, broadcast venues, and news outlets tweak their content naming scheme to match the organization’s style—or may reveal it unintentionally.

Here are a few numeric identifiers (but no pure date stamps), excluding English elements from each URL. These identifiers don’t represent the same content—they merely demonstrate diversity among numeric schemes, how a source may seek to be (or accept being) perceived, or coherence between technical and editorial missions.

   ?id=18617786 (abcnews.go.com) ABC News

   -62220678.htm (asia.cnet.com) CNET Asia

   /2125248/ (ask.slashdot.org/) Ask Slashdot

   /16270325/ (au.news.yahoo.com) Yahoo News AU

Large values suggest high volume content around the clock. Now consider these numeric elements of URLs:

   /488084 (alhayat.com) Al Hayat

   /440702/ (au.ibtimes.com) International Business Times AU

   ?story=44127 (alibi.com) Alibi

   ?articleid=1656537 (archinte.jamanetwork.com) Archives of Internal Medicine

   ?nxd_id=641582 (arkansasmatters.com) Arkansas Matters

Al Hayat and IBT are numerically matter of fact. Alibi views each item as a “story,” while JAMA calls each an “article.” By comparison, “nxd_id” at Arkansas Matters seems arbitrarily dry. The next group of URLs incorporate forms of a date:

   /201303011053.html (allafrica.com) All Africa

   /201331145711873289.html (aljazeera.com) Al Jazeera

   /AJ201303010078 (ajw.asahi.com) Asahi Shimbun

All Africa appends a sequence to publication date; Al Jazeera adds a timestamp, unhelpfully dropping zeroes from month and day. Asahi Shimbun is more readable, but a (superfluous) “AJ” tags items from “ajw” (Asia Japan Watch). As though to offset that redundancy, the “.html” is removed. And then there’s Arizona:

   /article_a58a9f67-1a06-5ef6-8944-f6b24ca5677f (azstarnet.com) AZ Starnet

This system suggests that content management was bid out. Perhaps AZ Starnet prefer to focus on reporting—and doesn’t expect humans to interact with URLs.

My job [see Sidebar on Nexcerpt] includes curating over 6,000 such active sources, representing hundreds of distinct schemes. It’s fascinating to consider what such diverse URL patterns reveal.

MEETING AND CONFOUNDING EXPECTATIONS
URLs may meet or confound expectations. URL schemes often contradict natural expectations. For example, while scanning for current awareness, it may seem reasonable to ignore URLs self-identifying as “archive”—but that’s not wise.

Today, 5% of Nexcerpt sources use “archive” in new item URLs. In 2001, the usage represented less than half of one percent.

Walmart (news.walmart.com), Stanford Securities Class Action Clearinghouse (securities.stanford.edu), and Fuel Cell Today (fuelcelltoday.com), among others, embed “/news-archive/” in new item URLs, along with date of publication.

New URLs contain “/archive/” across such diverse sources as Sacramento Bee (blogs.sacbee.com), Netcraft News (news.netcraft.com) Reason (reason.com), The Stranger (slog.thestranger.com), and Talking Points Memo (talkingpointsmemo.com) to name only a few. Again, these URLs also embed publication date. But, when new items exist only in “archive,” the common meaning appears lost. It’s puzzling to add that patina to “news.”

On the other hand, some naming schemes are unambiguous. Sports reporting produces massive volume. From 15 to 30% of news search results are game previews, results rundowns, or player interviews. With hundreds of venue sponsors and team mascots, any query focused on cities, brands, or animal species tends to return a clutter of (irrelevant) sports news.

We’ve developed several rules to reduce that noise. The simplest: ignore URLs containing “/sport” (followed by optional “s” or punctuation) then any of (calendar|headlines|hq|info|news|podcast|roundup|scores).

That single rule helps us avoid over 8,000 “news” items daily. Since important, business-related sports items echo elsewhere under “/sport”-free URLs, we reduce scanning with no loss of meaningful coverage.

URLS - Y U NO UNIFORM?

URLS - Y U NO UNIFORM?

REDUNDANT AND UGLY URLS
On the noise-control front, 15% of our rules transform URLs to reduce duplicated content. It’s astonishing how many “professional” news sites unintentionally offer multiple URL variations, assigning several non-parallel URLs to one item. We canonicalize URLs to correct such errors and avoid tasking servers with redundant requests.

“The Ugly URL” is another article, but if your system produces duplicate, missing, or absurd URLs, people notice!

On that note, I’ll congratulate The Horse (thehorse.com) for a recent upgrade. It offers valuable reports, for example, about healthcare (thehorse.com/free-reports/30922/winter-health-concerns).

Until recently though (February 27, 2013 to be precise) that URL—most of their URLs—carried a cookie [Ed: Line breaks added for... readability?]:

   thehorse.com/(F(e8wy2YfpF2HH40-VfKXLByc4iMY1dXIMhUV5J_
   MSLubqyazazQMSwjBiUyLoE47eaKVPKAwapOxN6jU6uQL2LG_
   xCNh2Ou4lw96V2hMdc9FOmWlqcEU5JNorj7QnJdSwrKSPdtzIoBMiiH8fzOT3CJE_
   TBwrE2DT6_ksXFSyOckGx9Ky3m-_SAGox8bKL131GsmlgQ2))
   /free-reports/30922/winter-health-concerns

Please try not to do that. Trust me. It’s for your own good.

IT’S A DATE!
Perceptive people are already assessing your website and content, at least in part, by reading your URLs. Some become very adept at assessment. We all do this to some extent, though often unconsciously.

Consider again that recognizable date (digits in ‘/NNNN/NN/NN/’ format), how we interpret it, and what it has replaced.

If we observe a sensible year (especially 1990 to 2013), month (01 to 12), and day (01 to 31), we presume that’s the publication date, particularly if the first four digits unambiguously match a recent year—”/2013/” is more convincing than “/13/”. Some sources “reprint” very old editions, back even beyond “/1900/”. However, most “dated” URLs are from the last decade or two.

We also are more likely to assume a date if the month is padded with a zero. That is, “/2013/01/” is clearer than “/2013/1/” (online, in fact, the latter more likely means “First Quarter or Volume of 2013″). Zero padding also makes the day more recognizable (and URLs sort chronologically).

As noted above, blogging popularized “directory” slashes above other delimiters. Although dashes (YYYY-MM-DD) were relatively common ten years ago, perhaps reflecting ISO 8601, from 1988 (http://en.wikipedia.org/wiki/ISO_8601) few online media still use them; underscores were, and remain, more rare.

In the early 2000s, more media (especially in EU) used “European” dates (where “/09/11/” means “November 09″), but that has waned. Some platforms use no delimiters (e.g., “20130415″), a practice also fading.

UNIX, JULIAN, AND OTHER DATE FORMATS
Personally, I like Unix (aka POSIX) timestamps. They’re precise and unambiguous—being the tally of seconds since 1970/01/01/00:00 UTC—but they’re awkward for humans, as they now contain ten digits (~1360000000).

When I began building rules in 2001, 10% of Nexcerpt sources (then among only 2,000) employed Unix timestamps in their URLs. Today, among 6,000 sources, use has fallen to below half of one percent. One Unix devotee is American Lawyer (americanlawyer.com), which may value the precision.

Unix also flickers among ABC broadcast affiliates, and Fairfax Media members such as Canberra Times (canberratimes.com.au) and Sydney Morning Herald (smh.com.au), as legacy elements.

Julian dates in URLs appear to be a thing of the past. Among our sources, major newspapers in Boston and Manila finally abandoned them, nearly simultaneously, several years ago. I think I speak for all sane people when I say, “Thanks for no longer rendering the things which were Caesar’s.”

Myriad other date formats also are increasingly rare in URLs—and rightfully so. (I find it stunning that defaults other than “/YYYY/MM/DD/” still appear in some blog platforms.) The bottom line is that some formats win, while some lose—and people recognize a winner.

WE’RE ALL NEWS PRODUCERS
Now we’re all news producers. Eventually, you’ll likely be involved in creating some new website or online repository. Please do not neglect or reinvent common practice. Your novel URL structure may be cute (not helping your audience), clever (actually confusing them), or fascinating (perhaps to Spock, or sentient androids, but not to humans). Your scheme may support log analysis (already being done) or SEO (ditto). You may even sense that your URLs are “obscured” (I used to work at NSA, so I’m laughing right now) from search engines who wish to understand your naming conventions.

Your giddiness over such novelty carries a price. Increasingly, if your URLs don’t reflect recognizable structure, perceptions of your content are tainted. I may be one of few persons studying URLs at this volume and detail, but I am not the only such person. We’re earnestly seeking better ways to assess the quality of online content. Our focus is on selection rules and ranking algorithms. If your URLs ignore common practice, or appear random, how do you think we’ll score that?

I don't always locate a resource - but when I do, it's uniform

LESSONS FROM THE FRONTIERS OF BIZARRE
To close, here are some bizarre schemes we’ve seen in production. Sources persisting in such behaviors shall remain nameless.

During 2010, one international source numbered URLs in reverse sequential order, like a countdown. (That one I’ll name: Asahi Shimbun, which had the good sense to stop in 2011, several months before zero.)

One major technology source uses URLs that encode a sequential article number with the (increasing) number of days since publication. Thus, the URL for an article changes completely every day—made more dramatic by its converting the string to base 36! (It’s easy, but not obvious, to derive the unique article number by reducing to base 10, and applying a simple modulo test.)

A significant number of medium-market newspapers use a scheme that constantly changes the date in every URL. Each URL contains a sequential (22-character hexadecimal) document ID. That ID persists, but the date in the URL matches whatever date you retrieve the article. In other words, you perceive “today” as “publication date,” no matter when you look.

Those make no sense to me—and I find sense in such things for a living.

To review your own URLs, consider at least:

1. Your URL structure may be the first impression a reader has of your content.

2. As a rule, URLs with recognizable dates or digits point to valuable content.

3. Any website “design” should include schemes for efficient and readable URLs.

4. As a rule, noisy “random-looking” strings in URLs ignore points 1 through 3.

Readers of Online Searcher likely have a professional obligation to understand URLs. All consumers and producers of news will be wise to consider what URLs are communicating—whether we realize it, or not.

# # #

SIDEBAR1

Universal Uniformity University

Reading URLs is the new literacy. It’s an essential job skill, which we have little excuse for lacking: URLs haven’t changed much in twenty years.

As early as 1991, Tim Berners-Lee at CERN described the URL system we still use. For a sense of how stable it has long been, see his 1994 “Universal Resource Identifiers” (w3.org/Addressing/rfc1630.txt). I find it hilarious that in the earliest descriptions, Berners-Lee uniformly used the word “Universal,” while “Uniform” is now used universally!

To understand URLs fully, it’s helpful to compare URIs—not “Locators” but “Identifiers.” Incomprehensibly, the W3 link to 2005’s “Uniform Resource Identifier (URI): Generic Syntax” is broken! A copy is available from IETF (tools.ietf.org/html/rfc3986).

# # #

SIDEBAR2

After 200 million URLs, I’ve seen it all

Not long after Nexcerpt’s launch, Barbara Quint reviewed our service in the April 2003 issue of Searcher (infotoday.com/searcher/apr03/voice.shtml). Her “killer product” quote is still linked from our site (nexcerpt.com). We’ve provided custom excerpts to private clients every day since. For examples of our Exfacto! brand of automated public feeds, see Anti-Phishing Working Group (apwg.org/apwg-news-center/news-feed-of-the-week/) or Crowdlanding (crowdlanding.com/crowdfunding_news.php ).

Over the years, we’ve grown from observing about 20,000 new articles per day to current rate of some 80,000 new articles per day (from over 6,000 online sources). In February 2013, Nexcerpt monitored its 200 millionth article!

Nexcerpt invokes source-specific rules (as regular expressions) to assess the likelihood that URLs provide valuable content. Daily reports detail the text volume, keyword tally, and other performance data seen for each URL. Where Nexcerpt’s rules are solid, article data reflect it. Otherwise, article data reveal a rule shortcoming, which we address.

These rules have become very effective in assessing article value—based primarily upon the internal structure of URLs. Across our 6,000 hand-curated sources, our processes observe some 4 million URLs per day. Based upon URL structure alone (and the accrued knowledge within rules) we assuredly discard 98% of those URLs; over half as not current, the rest as of limited value.

It’s likely that you already make similar judgments. If you don’t, you may want to start reading URLs more closely!

After leading design and development of Nexcerpt, my focus turned to quality control for these URL assessment rules. Some rules are created or validated by automated processes. However, it’s hard to beat the human eye for noticing new patterns, commonalities, and anomalies across such a mass of data.

This has led me to scan all 200 million URLs by eye. (Part of me recoils at that admission—at least my mouse wrist!) Yes, over 10 years, I’ve personally viewed all those URLs and their associated performance data. I’m so accustomed to it that a visual scan of 40,000 URLs each morning, and another each evening, typically consume less than an hour of my day, though I often spend longer tweaking rules accordingly. And, yes, I still enjoy it!

That’s how I accrued uncommon expertise on the ways media sites form (and malform) URLs. Nexcerpt captures the full range of behaviors (desirable and otherwise) across domains, platforms, owners, and time. My task is to notice URL behaviors and structures, and tease out rules to assess their value.

# # #

Gary Stock (gstock@nexcerpt.com) is CEO of Nexcerpt, a custom news clipping and briefing service. He has created and managed online monitoring and search tools since 1996. Previously, he pioneered computer validation protocols for The Upjohn Company, and codeword-classified analytical methods for the National Security Agency’s Special Projects group. He aptly discovers patterns invisible to others and converts them to valuable assets. Over a varied career, Gary has presented to international gatherings including Infonortics Search Engine Conference, Information Today’s National Online Meeting, Association for Global Strategic Intelligence, and International Society for Pharmaceutical Engineering. Media interviews have twice consumed weeks of his life, over wildly viral sites he created: egosurf.com (1998) and googlewhack.com (2002). He is an accomplished jazz pianist, and his free hours focus on habitat stewardship for rare native flora and fauna, wetland conservation policy, and land use planning.

   

Comments are closed.



Recent Posts and Other Categories