Whenever you do an online search for information, you are searching a database. Information science is the study of how databases work--more specifically, how they store and retrieve information.
Knowing some key concepts from information science will help you understand the databases you're using and how to improve your ability to find exactly what you're looking for. So, before we dig in to the tools, tips, and tricks, let's introduce a few basic information science concepts.
In addition to the Boolean operators AND, OR, and NOT, most databases have other operators to help you search more effectively.
(cattle OR cows OR livestock) AND methane
.Google has it's own unique syntax for some Boolean operators:
planes trains automobiles
is interpreted by Google as planes AND trains AND automobilessteel -stainless
is equivalent to steel NOT stainlessAs you can see, the relationship between precision and recall can be complicated and confusing. In an early example on the video, only one marker out of three was found, but because we didn’t get any unwanted pens, the precision was 100 percent even though the recall was only 33 percent. On the other hand, in the last example, all the markers were found along with all the unwanted pens; the recall in this case was 100 percent, but the precision was only 50 percent. Striking a balance between precision and recall is a difficult problem; increasing one often decreases the other, so most search tools optimize one at the expense of the other. It's impossible to calculate the precision or recall of Google due to the size of the database, but they are low--estimates rarely exceed 30 percent and are sometimes as low as 5 percent.
You can use Boolean operators to improve precision and/or recall, but they can only go so far. For example, at the end of the video, we got all the markers plus all the unwanted pens. Using the Boolean expression marker NOT pen
should give us 100 percent precision and 100 percent recall. But if one of the markers is identified in our database as a “marker pen,” it would now be excluded, reducing recall to 67 percent. That’s because NOT ignores context. In this case it excludes every record containing the word “pen” even if it’s part of the phrase “marker pen.” As another example, let’s say we only found one out of the three markers early in the video because two of them were identified in our database as “highlighters.” Using the Boolean expression marker OR highlighter
should find them all. But what if there are other alternate names for the markers? Even worse, what if there are typos in our database like “msrker” or “marlkeer?” Trying to string together every possibility using Boolean operators is simply not practical. Something more is needed.
One of the most powerful ways of improving both precision and recall is the use of what's called controlled vocabulary.
In theory, once you know the controlled vocabulary term for the topic you're searching, you will be able to search with both 100 percent precision and 100 percent recall. The main drawback with implementing a controlled vocabulary is that human intervention is needed both to create the vocabulary terms and to apply them to every item in the database. While it's quite possible to apply a controlled vocabulary to a database having hundreds or thousands of records, it's not possible (yet) to do so for the trillions of web pages indexed by Google or other web search engines.
This guide is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.