• Votes for this article no votes for this yet
  • Dashboard Insight Newsletter Sign Up

Can You Trust Your Data?
Why Keywords and Search Alone are Not Enough

by Catherine van Zuylen, Vice President of Product Marketing, Attensity GroupThursday, February 24, 2011

For years, information that was relevant to an enterprise was numeric and almost entirely found within the enterprise. Sales figures, inventory levels and gross margin numbers were trusted data that was used to drive business decisions.

Managers know that to successfully run a business, you must also take into account a large amount of “unstructured” information. This information includes customer relationship management (CRM) and call center notes, analyst reports and emails. This led to the rise of text mining products to be able to take this “unstructured” information and classify it for use in downstream data analysis.

Recently, more than just the nature of the data has changed. People are turning to social media to communicate their experiences and issues with products and companies, and this information is being added to traditional sources of “voice of the customer” information like surveys, emails, and CRM notes. The Internet has grown into a treasure trove of customer, product, competitive, research and market information.

The Rise Of Search

In response to the rising volumes of data inside companies and the growth of the Internet in the late 1990s, search became a ubiquitous technology in enterprises and for consumers. Search can be a very useful tool for locating a particular document of interest containing a search term.

However, search is inadequate for revealing trends or performing complex operations. For example, if a company is interested in monitoring social media for its customers’ top 10 product issues, it will generally create queries for the company name, their products, and maybe even include keywords of known issues. Then, personnel must manually read through the search results and tally up the issues by hand. Depending on the topic and the brand, the data returned can be enormous and the risk of the inclusion of a lot of “noise” very high.

Manual search and recording isn’t that difficult if you are talking about processing, say, 100 survey comments a month. But it’s impossible to get a near real-time snapshot of what people think about you in social media by using only this method. It’s also difficult to get a uniform manual coding structure to which multiple people can adhere to create an effective, reliable database of information.

The Emergence Of Keyword-Based Text Mining

Using search becomes even more complicated when it is difficult to define a small set of keywords that can get you the answers you seek. Other actions difficult to do with the search engine approach include having an engine automatically route to product management various product suggestions, or having customer service be quickly and automatically alerted to messages indicating an intent-to-leave.

A similar situation used to exist in the world of physical goods. When there was a limited variety of goods sold at the corner shop, it was not a big deal to manually inventory, sort, and count goods. But, as the volume and variety of goods increased, and as just-in-time delivery of those goods became more expected, the process needed to become automated. So the barcode was invented and standardized as a standard way of expressing the “aboutness” of a product.

Likewise, with the rise of unstructured data volumes and the need for just-in-time delivery of information, a new system was needed that would go beyond keyword indexing. This new system was text analysis – a way of “barcoding text” that could express information about that text (what it is, where it is located, how it relates to the words around it) and allow it to be mined for information and moved to the appropriate place based on its “aboutness.”

The first text analysis systems were just basically extensions of the search indexing principle. “Categories” were defined by creating sets of keywords that could be used to define a specific characteristic. This sort of keyword-based system finds sentences like “my room smelled bad” or “this room is really stinky” and classifies those sentences into a category of “smelly bedroom” on which a report can then be generated.

Challenges Inherent to Keywords-Based Classification

Language is a tricky thing. While a keyword-based system will accurately locate and mark these examples, it begins to have accuracy issues when dealing with more complex text. These accuracy issues are around both

  • precision (the number of items correctly labeled as belonging to a class [true positive] divided by the number of elements incorrectly labeled as belonging to the positive class [false positive]);and
  • recall (the number of true positives divided by the total number of elements not marked but that actually do belong in the class).

Another problem arises with the complexity of keyword-only-based category definitions. For example, a retail store might like to track electronics department “neatness” from survey and other comments. So they laboriously construct a category looking something like the below:

electronics department, electronics dept, TV department, electronics area


bedraggled, begrimed, black, contaminated, cruddy, crummy, defiled, dirty; disarrayed, dishabille, disheveled, dreggy, dungy, dusty, filthy, foul, fouled, greasy, grimy, grubby, grungy, horrible, icky, lousy, messy, mucky, muddy, mung, murky, nasty, pigpen, polluted, raunchy, scummy, scuzzy, slatternly, slimy, sloppy, slovenly, smudged, smutty, sooty, spattered, spotted, squalid, stained, straggly, sullied, undusted, unhygienic, unkempt, unlaundered, unsanitary, unsightly, unswept, untidy, unwashed, yucky, uncluttered, clean, cleansed, clear, delicate, dirtless, elegant, faultless, flawless, fresh, graceful, hygienic, immaculate, laundered, neat, neat as a button, neat as a pin, orderly, pure, sanitary, shining, simple, snowy, sparkling, speckless, spic and span, spotless, squeaky, stainless, taintless, tidy, trim, unblemished, unpolluted, unsmudged, unsoiled, unspotted, unstained, unsullied, untarnished, washed, well-kept

Not only does this approach require a good deal of imagination but it is also quite brittle. Even using this extensive list of keywords to define a category, there are still many false positives and misses:

Actual Sentence Keyword Extraction
“Electronics department could be cleaner” Not detected because “cleaner” is not defined in the category set. This is a recall issue.
“The electronics zone was dirty” “Dirty” is extracted, but because “electronics zone” isn’t a keyword, this instance is missed.  This is also a recall issue.
“electronics department clerk was rude. cashier had filthy hair and tattoos.” Here, “filthy” could be erroneously associated with the electronics department as opposed to the electronics department clerk. This is a precision issue.
“Allison helped me in the electronics department and she was fantastic, while the housewares department was HORRIBLE” “horrible” and “electronics department” falsely categorizing as electronics department neatness issue. This is a precision issue for this task.

In addition, by using only keyword-based classification, you miss chances for discovery of issues. For example, if you have many people complaining about tattooed employees, but you haven’t previously thought to create a “tattooed employee” class, you might never know this was a problem.

Unfortunately, these types of deficiencies are not always readily apparent in a demonstration of a keyword-based system. Many times, a keyword-based software manufacturer will take sample data provided by the customer and then perfectly fit the keyword categories to the task at hand. Only after a company has run real-world data through the system unaided by the manufacturer do these types of issues begin to surface.

A New Type Of Text Analytics: Exhaustive Extraction

In answer to the brittleness, a new type of text analytics system arose that not only takes into account keywords, but the context of those keywords.

This type of “exhaustive extraction” is the ability to automatically extract people, places and things and their roles and relationships in text. It looks at words and their surroundings, diagramming sentences and phrases in much the same way as the human mind interprets them. It parses this content to extract facts, relationships and sentiment from this data.

For example, let’s look at the sentence “It was even a smoking room, but I could not smell anything.” An exhaustive extraction uses the linguistic structure of the sentence and automatically extracts both entities (such as the location “a smoking room” and the person “I”) and relationships or events, as interpreted in what is called a triple: I:smell[not]:anything.

hospitality flowchart diagram

This indicates a non-smelly room event that can be compared with other events that are found and analyzed much like structured data.

Likewise, “the room was clean, on a smoking floor, and smelled fresh” is properly coded: the room:smell:fresh. Again, this is correctly interpreted as a positive event:

Finally, in the example “the room was clean, but the hallway did smell of smoke”, the correct extractions are again made:

By using this type of advanced linguistic understanding, which is done automatically and with no reliance on keywords, we identify and extract the true meaning of a customer’s comments.

Voice Tags

Voice tags refer to additional information about an extraction that can change its meaning. The change can be subtle or extreme, and provides customers crucial insights into their data. There are seven different voice tags:

  • Question [?] voice indicates that the sentence from which the fact was extracted was in the form of a question.
  • How can I get free shipping with future orders?

free shipping : get [?]

  • Condition [if/then] voice can be utilized to find these priceless customer service opportunities to mitigate circumstances and to persuade customers to retain their loyalty to the company.
  • I would shop much more frequently if you offered free shipping.

free shipping : offer [if/then]

  • The function of the intent [intent] voice is to depict people’s intentions or desires. This voice gives heightened insight into voice of the customer as it reveals what a person wants, threatens, or tries to do.
  • I plan to shop here often.

I : shop [again] [intent]

  • The function of a Negation [not] object is literally to negate the meaning of the verb:
  • I didn’t find what I was looking for.

what I was looking for : find [not]

  • The Augment [more] voice detects emphasis and helps differentiate between degrees of sentiment
  • Your selection is extremely limited

selection: limited [more]

  • The function of the recurrence [again] voice is to indicate that the action in the sentence has happened before, or is happening in an ongoing, recurring fashion.
  • Your prices are still high

price : high [again]

  • The indefinite [maybe] voice can be used to represent suggestions or requests.
  • I wish you would offer incentives like coupons.

incentive : offer [maybe]

Returning to our previous retail example, the combination of technologies changes the results to look like this:

Actual Sentence Keyword Extraction

Text Analytics
(Exhaustive Extraction + Voice Tags)

“Electronics department could be cleaner” Not detected because “cleaner” is not defined in the category set. This is a recall issue. department (electronics): clean [maybe]
“The electronics zone was dirty” “Dirty” is extracted, but because “electronics zone” isn’t a keyword, this instance is missed.  This is also a recall issue. zone (electronics): dirty
“electronics department clerk was rude; she had filthy hair and tattoos.” Here, “filthy” could be erroneously associated with the electronics department as opposed to the electronics department clerk. This is a precision issue.

electronics department clerk:have: filthy hair
electronics department clerk:rude

electronics department clerk:have: tattoos
“Allison helped me in the electronics department and she was fantastic, while the housewares department was HORRIBLE” “horrible” and “electronics department” falsely categorizing as electronics department neatness issue. This is a precision issue for this task.

department (housewares) : horrible
Allison : help

Allison : fantastic

As you can see, a true exhaustive extraction system reveals a wealth of information contained within every customer communication or relevant social media interaction, without requiring the anticipation of every permutation and combination of words you want to analyze.

For example, a keyword-based system might tell you that there are a lot of customers talking about cupholders, but it isn’t sophisticated enough to also reveal, unaided, whether those customers are saying they want bigger cupholders, more cupholders, or that their cupholders are breaking.

Advanced text analysis software also employs sophisticated linguistic capabilities like anaphora resolution, which enables the system to automatically resolve a pronoun back to its proper noun. For example, when I use the sentence “The salesperson was really great; she helped me to understand the product,” it will correctly resolve that it was the salesperson who helped me to understand the product – that I was the one who was having difficulty, not the salesperson.

These newer business intelligence systems should also be able to interpret colloquial language, emoticons and other “social speak” so prevalent in online discussions, with no additional effort on the part of the user — something that simple keyword-based systems cannot do.


The keyword-only data-mining model for classification can be flawed and time-consuming for true business decision-making. For any enterprise that needs to use insights found in unstructured data for action such as customer service, it’s critical to use deep, automatic analytics to parse, process and classify text. Only the ability to understand the true meaning and relationships between people, places, things and issues provides accurate enough output on which companies should take action.

About the author

Catherine van Zuylen serves as the vice president of product marketing for Attensity, which provides text analytics solutions for Customer Experience Management. Attensity helps the world’s leading brands leverage multi-channel customer conversations as a business asset. Catherine has more than 15 years of experience in defining and implementing new positioning, product, and branding strategies. Catherine can be reached at cvanzuylen@attensity.com. She blogs at http://blog.attensity.com/ and tweets under @catevz.

Tweet article    Stumble article    Digg article    Buzz article    Delicious bookmark      Dashboard Insight RSS Feed
Other articles by this author


No comments have been posted yet.

Site Map | Contribute | Privacy Policy | Contact Us | Dashboard Insight © 2018