For nearly two years, I’ve been researching alternative approaches to analyzing social media “chatter.” Frankly, as the social media movement started gaining momentum more than five years ago, many of us knew the challenge ahead. And I’ll be honest when I say that I’m still not sure there’s a good answer out there.
Discourse analysis, sentiment analysis, text analytics: whatever you call it, you’re looking at a huge volume of information. Unstructured information.
When CRM started in the 1990s, the challenge was to link customers to behaviors and make recommendations to keep them happy/buying more. Think of the Amazon “recommendation” feature: “people who looked at this, bought this” or “if you liked this, may we recommend…” They were able to “mine” purchase data to make better shopper suggestions.
Text mining is infinitely more complicated. The longer a blog post, Facebook comment, or product review, the more difficult it is to categorize. Twitter posts and other microblogs are theoretically easiest because they’re so short.
How It’s Done. Comments are filtered based on keywords (e.g., brand name). Next, as I understand it, using proprietary algorithms (depending on the software), comments are grouped into three buckets: positive, negative, and neutral.
The longer the post, the more likely the comment is to be classified as neutral. This isn’t surprising. Think of the complexity in nuance, sarcasm, slang, humor and differences in word usage across different cultures speaking the same language. (Anyone watch the Emmy’s Sunday night and see Ricky Gervais? He was laughing at Bucky Gunts’ name and it’s taken me 10 minutes to figure out why – so bad I can’t repeat it here!)
I’ve seen statistics showing that between 65-80% of all comments are tagged as neutral. The number is somewhat lower among the microblogs – maybe only 40%. Nevertheless, it’s obvious that huge biases can be introduced using this approach.
Elsewhere I’ve read that humans aren’t very accurate either. Using Mechanical Turk (which isn’t something I’d personally recommend), humans agree only 79% of time on how items should be coded.
We should be moving forward in this discipline. Tom Anderson, of Anderson Analytics, agrees. In a correspondence with him several months back, Tom said that he was using three different software packages and gained confidence based on the intersection their joint findings. I’m not sure if more data is successfully coded or if this just provides more confidence in what he reports as findings to his clients. I think this is very responsible, given the accuracy of single-solution approaches today.
Currently, I’m leaning in the direction of relying more on Market Research Online Communities (or Insight Communities), which straddle the space between pure discourse analysis and more structured qualitative or quantitative marketing research. I view them as a hybrid, where I can get out of the way and follow the discussion. With only 200-500 members, keeping up and staying accurate is more manageable. Further, when custom research is needed, the recruit is a snap.

And sometimes we don’t need to build communities. One may already exist because the product or person built their following online. In this case, the fan group may be the “go to” community because it’s larger, more representative and more cost effective than creating a new community. Think Justin Bieber, who was “discovered” by his YouTube videos. Those subscribing to his stream, particularly in the early days, would have been a great insight-mining resource.
I’ll be keeping a close eye on both these topics as they appeal to both my quant and qual sides. And if you have any thoughts, I’d love to hear and learn!
Some reading:
The best of the articles I read, BrandSavant: http://brandsavant.com/the-hidden-bias-of-social-media-sentiment-analysis/
August 2010 Quirks has several articles on social media research
Mashable: http://mashable.com/2010/04/19/sentiment-analysis/
