Helping science succeed
Helping science succeed

A brief rant about Google’s anti-science search algorithm

The internet is a big place. How big? Well, it depends on what you’re counting and when (because the numbers change so quickly) but by some measures the internet contains over a trillion gigabytes of information, 500 million websites and 15 trillion web pages.

That’s a lot of information. And for us mere humans, it’s impossible to accumulate let alone sort through this kind of bulk without some very cleverly crafted computer assistance. Specifically, without internet search engines we would be totally, utterly lost on the web.

The good news, of course, is that computer scientists licked this problem years and years ago and have continued to refine the capability of internet search engines ever since.

But somewhere along the way this task of searching the web became trickier. People figured out how to scam the system and increase their search rankings. So search engines had to fight back. In order to maintain some control over the chaos it was important to recognize the tricks being used to increase visibility and make sure these tricks weren’t creating an unfair advantage—that search results weren’t being biased in favor of flash over substance.

For a long while, flash had the upper hand. Then, Google unleashed a scorched earth campaign.

Say what now? When did all this happen? Isn’t Google a benevolent company that only acts in the public good? Doesn’t everyone love Google? Well, as it turns out, it depends who you ask, and definitely not.

Since early 2011 Google has been rolling out a major change to the way its search engine operates that has been throttling the way we experience the internet, deliberately burying huge amounts of information that Google has deemed essentially worthless. Google search doesn’t capture everything to begin with, but that’s not the issue here. Of the 15 trillion pages on the internet it has indexed “only” around 50 billion (which is still far more than other search engines). The issue is that of these 50 billion pages, Google —like any search engine— needs to decide what’s worth showing and what’s not. It wouldn’t help us at all to type in the word “car” and have our first 100 results be the sites that use the word “car” the most. There has to be some refined method to the madness here, figuring out what we need and combining this with would be most helpful for us to see.

The way Google pulls this off—the way every search engine does this—is to create an algorithm that begins with certain rules about what to include in search results. This algorithm then learns on the job (with help from humans) and refines its accuracy over time—at least its accuracy at achieving whatever objectives have been set forth by human programmers.

So in early 2011 the search engine programmers at Google decided that the internet was too crowded with low quality content—spam, plagiarized content and sites that just packed a lot of stuff into one space for no apparent reason other than to attract visitors and boost ad revenue—and decided to create an algorithm to filter this kind of content from searches. Or at least that’s the theory. Named “Panda,” Google’s true motives for creating this new algorithm and the exact nature of how it operates are all trade secrets. The search engine optimization industry has been working overtime over the last four years trying to figure out exactly what Google is up to, why (explanations range from heroic efforts to combat spam, to sinister profit-oriented motives to boost big brands and paid search), and how to cope with the upheaval this move has created.

And upheaval is a mild way of stating that Panda has not only changed what we’ve been able find on the internet since 2011 but it has even redefined what types of businesses and information models have been allowed to succeed on the internet since then. According to critics, Panda’s implementation has destroyed many small businesses by making them invisible on the internet, and it is also destroying many information models by deeming them undesirable, even unworkable.

How can all this be possible? Search rankings tell the whole story. When a website falls lower than the first few pages—when you type in “cars” and your car business doesn’t show up until page 55—this “organic search” capacity of the internet is not helping your business, unless people search for you by your specific business name. You can also pay Google to place you higher on the page as a “paid search” listing.

This much is fine, though. This is just business on the internet.

The problem comes with what else is being eliminated. The main criticism of Panda is that whole websites with duplicate content appear to be penalized—as in pushed far down in search results—in many (but mysteriously not all) instances, along with sites that have “low-quality” content that is not unique, well-written, or informative, and that isn’t strongly linked to the keywords for that page. So if a car dealership thinks they can just add a blog post to their website about helping out with a recent Girl Scout cookie sale in order to boost their search visibility, sorry—that’s a great gesture but it doesn’t matter to Google because cookies and cars (the keyword) don’t mix.

Sites with “thin” content are also being penalized. If you operate a website that contains bits and pieces about a wide variety of topics you’re out of luck—you’re demoted—and sites that mostly just offer up contact information and a few other “trust signals” (like reviews, ratings, and social media links) but little content are also judged unworthy of your attention and are buried so deep in search results you will never find them.

There are exceptions to these rules, of course. Google examines its algorithm and surveys real people to find out if the websites it’s excluding are worthy of re-inclusion. They intervene with personal decisions about what is quality content, and what is not—which in itself may not be so onerous, except for the part about how you don’t get to make these decisions yourself.

Among the many victims of this new model for what’s allowable on the internet have been travel websites, medical information sites, ecommerce partner sites (like for Amazon), college websites without “rich” content experiences, directories, government websites without rich content, news portal sites, and more.

Sites that curate content have also been hard hit, judged by Panda’s programmers to be no better than sites that steal content. So trade groups and special interest groups that post published headlines or news releases relevant to their memberships have essentially disappeared from Google’s organic search because Panda’s programmers decided that the re-posting of headlines is not valuable in today’s world. And nonprofits that curate information about cancer treatment, for instance, have also disappeared from Google because in the eyes of this search algorithm, this information can and should be found as separate articles on the internet and not as a list of vetted and organized article titles and extracts posted on one site.

This decision by Google about what is and is not worthy content carries real consequences for many businesses. That same nonprofit that thought it worthwhile to post headlines for its members may also rely on web traffic in order to survive—in order to attract ad sponsors, new members and donations. But when they fall from page three to page 30 on a Google search their businesses are doomed, along with this model of information dissemination and community service.

Many government portals have also been hit hard. Today when you search for “cancer research” on Google you get a few meaningful sites on page one and then a whole lot of original secondary source material for the next 25 pages—news articles, blog posts, testimonials and so on. Good luck finding cancer.gov—-the website of the National Cancer Institute, the agency at the center of the world’s fight against cancer. NCI is effectively invisible on Google—not even on the first 25 pages of search results. The NCI site is an incredibly useful starting point with links to research, definitions, care resources, clinical trials and so on, but it is “content poor” when it comes to cancer research in the estimation of Google and therefore not worthy of being noticed. The same goes for cancer research organizations like the Fred Hutchinson Cancer Research Center—one of the nation’s largest cancer research institutions and a global leader in the fight against cancer yet invisible on Google if you’re searching for cancer research. Fred Hutch is preceded on Google by dozens of newspaper articles on cancer, books, and blog posts, if that’s what you really want.

The same applies to other searches. Google “science education” and you get an eclectic amalgam of secondary information (topped by Wikipedia, which appears near the top of most Google searches now because it contains “unique content”—never mind that most people in science education never use Wikipedia as a primary source of authoritative content), but not organizations like the National Science Teachers Association .

There is certainly enough anecdotal evidence of harm here to warrant a closer look. Perhaps a study should be funded to measure the real business and societal impact of the Panda rollout in more detail. Google has often stated that this is a work in progress and that instances of legitimate harm to businesses or content should be reported.

But maybe more voices of protest are needed, too. Panda’s impact on website traffic has been very widespread and very clear to affected businesses and industry observers for years now, and it may even be impacting the open information framework in general— the visibility of open access and creative commons (CC) licensed content that were supposed to flourish in a wide-open web environment. No one knows for sure whether this has happened—it simply hasn’t been measured yet—but it wouldn’t be unreasonable to assume that if illegally duplicated content is being penalized by Panda, then sites posting legally copied OA and/or CC content (ironically, content that was created with the express hope that broad sharing and re-posting would increase its visibility)  are also being penalized because there may be no recognizable difference between the two formats to Google’s web crawlers.

So if harm is in fact being done to OA and CC, and if this Google situation persists, what does it mean for the future of open knowledge? What does it mean for the future of open access, which science publishing has been relying upon to improve access to research information?

To be sure, some of the worst spam sites that existed on the internet prior to 2011 have also been hit hard by Panda but the innocent victims of this massive information reorg have far outnumbered the spammers who have been curtailed—and legitimate business losses aside (which may be a huge aside), we’ve all be hit by being less able to find the information we need.

This isn’t an isolated situation—it’s been happening across the business spectrum and around the world for the past several years and ongoing refinements to Panda continue to tighten the noose and change the search rules. According to one internet researcher firm, after the September 2014 Panda update some brands saw upward of a 90 percent loss in their organic search footprint.

So why not just use a different search engine? Why does any of this matter at all? Why indeed. Baidu, Bing and Yahoo all do a more accurate job than Google when it comes to finding government portals of research information like cancer.gov. It matters because Google wields an extraordinary amount of power in determining what we find and cannot find (at least by searching) on the internet. Somewhere between 60 and 70 percent of all the world’s searches are conducted on Google—over a billion search queries per month. That’s more than all the other search engines combined. In any other industry this power would be characterized as a monopoly and subject to government intervention.

So should Google search be classified as a monopoly and subjected to government regulation? We’ve already taken a step in this direction by classifying the internet as a public good. But how about the main tool we use to search the internet? Google is in an unenviable position right now of being a monopoly sitting astride a public good, and history has shown that from railroads to electricity, the government isn’t bashful about making sure that the power wielded by monopolies in this situation is equitably distributed. And right now, allowing one company to so totally define how we search for information and what kind of information we find is not only bad idea but one that is inviting intervention.

Maybe Google can be proactive here and introduce a variety of search engines to suit the needs and preferences of a variety of audiences? Or rethink the impact of its war on content? Or maybe Microsoft can do more with Bing to show how it is superior to Google?

Whatever the solution, we’ve become complacent with our information discovery expectations. We’ve grown to believe that Google is an adequate tool for discovering what’s on the internet, and is also a benevolent monopoly acting in an objective, scientific manner to ensure that we all experience the internet in the most unbiased, unvarnished way possible. Google has certainly done important work to help chart a path through the internet as it has evolved. And Google Scholar has done wonders to improve the discoverability of research information.

But Panda is neither benevolent nor objective. One might even make the case that it has been harming the free flow of information on the internet. And yet we seem to remain secure in our belief that we’re seeing all the information we need to see—nothing less—and that Google should be the arbiter of how we create, organize and share our information.

Should it be? Google epitomizes the same free-wheeling, hard-charging spirit of equality and self-empowerment that is at the core of the internet experience itself. But perhaps Panda is the wrong tool for the job. For now, it has made Google’s victory over “undesirable” content pyric at best.

The original version of this article was posted at 4:14 PST on March 30, 2015. An edited version was posted at 8:45 p.m. on the same day. The above edited version was poted on October 7, 2019.