Skip to Main Content

Going Beyond Google

How to Find Information Like A Research Pro, including the use of research databases, and Boolean logic

A Brief Introduction to Information Science

Whenever you do an online search for information, you are searching a database. Information science is the study of how databases work--more specifically, how they store and retrieve information.

Knowing some key concepts from information science will help you understand the databases you're using and how to improve your ability to find exactly what you're looking for. So, before we dig in to the tools, tips, and tricks, let's introduce a few basic information science concepts.

Boolean Logic

In addition to the Boolean operators AND, OR, and NOT, most databases have other operators to help you search more effectively.

  • Parentheses (): Used to group operations as in mathematical equations. The search engine will do the operations grouped in parentheses first. This is especially useful when we're using synonyms. For example, if we want to find information about cattle and methane, we could use (cattle OR cows OR livestock) AND methane.
  • Quotes " ": Used to specify that the quoted words must be searched together as a phrase, not as individual words.

Google has it's own unique syntax for some Boolean operators:

  • AND is the default. You don't need to specify AND.
    • planes trains automobiles is interpreted by Google as planes AND trains AND automobiles
  • NOT is replaced by the minus sign/hyphen "-"
    • steel -stainless is equivalent to steel NOT stainless

Precision and Recall

As you can see, the relationship between precision and recall can be complicated and confusing. In an early example on the video, only one marker out of three was found, but because we didn’t get any unwanted pens, the precision was 100 percent even though the recall was only 33 percent. On the other hand, in the last example, all the markers were found along with all the unwanted pens; the recall in this case was 100 percent, but the precision was only 50 percent. Striking a balance between precision and recall is a difficult problem; increasing one often decreases the other, so most search tools optimize one at the expense of the other. It's impossible to calculate the precision or recall of Google due to the size of the database, but they are low--estimates rarely exceed 30 percent and are sometimes as low as 5 percent.

You can use Boolean operators to improve precision and/or recall, but they can only go so far. For example, at the end of the video, we got all the markers plus all the unwanted pens. Using the Boolean expression marker NOT pen should give us 100 percent precision and 100 percent recall. But if one of the markers is identified in our database as a “marker pen,” it would now be excluded, reducing recall to 67 percent. That’s because NOT ignores context. In this case it excludes every record containing the word “pen” even if it’s part of the phrase “marker pen.” As another example, let’s say we only found one out of the three markers early in the video because two of them were identified in our database as “highlighters.” Using the Boolean expression marker OR highlighter should find them all. But what if there are other alternate names for the markers? Even worse, what if there are typos in our database like “msrker” or “marlkeer?” Trying to string together every possibility using Boolean operators is simply not practical. Something more is needed.

Controlled Vocabulary

One of the most powerful ways of improving both precision and recall is the use of what's called controlled vocabulary.

In theory, once you know the controlled vocabulary term for the topic you're searching, you will be able to search with both 100 percent precision and 100 percent recall. The main drawback with implementing a controlled vocabulary is that human intervention is needed both to create the vocabulary terms and to apply them to every item in the database. While it's quite possible to apply a controlled vocabulary to a database having hundreds or thousands of records, it's not possible (yet) to do so for the trillions of web pages indexed by Google or other web search engines.