How Data Mining Visualizes Story Lines in the Twittersphere

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

We spend hours on Instagram and YouTube and waste money on coffee and fast food, but won’t spend 30 minutes a day learning skills to boost our careers.
Master in DevOps, SRE, DevSecOps & MLOps!

Learn from Guru Rajesh Kumar and double your salary in just one year.

Source: discovermagazine.com

One curious side-effect of the work to digitize books and historical texts is the ability to search these databases for words, when they first appeared and how their frequency of use has changed over time.

The Google Books n-gram corpus is a good example (an n-gram is a sequence of n words). Enter a word or phrase and it’ll show you its relative usage frequency since 1800. For example, the word “Frankenstein” first appeared in the late 1810s and has grown in popularity ever since.

By contrast, the phrase “Harry Potter” appeared in the late 1990s, gained quickly in popularity but never overtook Frankenstein — or Dracula, for that matter. That may be something of surprise given the unprecedented global popularity of J.K. Rowling’s teenage wizard.

And therein lies the problem with a database founded on an old-fashioned, paper-based technology. The Google Books corpus records “Harry Potter” once for each novel, article and text in which it appears, not for the millions of times it is printed and sold. There is no way to account for this level of fame or how it leaves others in the shade.

Today that changes, thanks to the work of Thayer Alshaabi at the Computational Story Lab at the University of Vermont and a number of colleagues. This team has created a searchable database of over 100 billion tweets in more than 150 languages containing over a trillion 1-grams, 2-grams and 3-grams. That’s about 10 per cent of all Twitter messages since September 2008.

Data Visualization

The team has also developed a data visualization tool called Storywrangler that reveals the popularity of any words or phrases based on the number of times they have been tweeted and retweeted. The database shows how this popularity waxes and wanes over time.

“In building Storywrangler, our primary goal has been to curate and share a rich, language-based ecology of interconnected n-gram time series derived from Twitter,” say Alshaabi and co.

Storywrangler immediately reveals the “story” associated with a wide range of events, individuals and phenomenon. For example, it shows the annual popularity of words associated with religious festivals such as Christmas and Easter. It tells how phrases associated with new films burst into Twittersphere and then fade away, while TV series tend to live on, at least throughout the series’ lifetime. And it reveals the emergence of politico-social movements such as Brexit, Occupy #MeToo and Black Lives Matter.

The storylines can also be compared with other databases to provide more fine-grained insight and analysis. For example, the popularity of film titles on Twitter can be compared with the film’s takings at the box office; the emergence of words associated with disease can be compared with the number of infections recorded by official sources; and words associated with political unrest can be compared with incidents of civil disobedience.

That’s useful because this kind of analysis provides a new way to study society, potentially with predictive results. Indeed, computer scientists have long suggested that social media can be used to predict the future.

Cultural Significance

These storylines have social and cultural significance too. “Our collective memory lies in our recordings — in our written texts, artworks, photographs, audio and video — and in our retellings and reinterpretations of that which becomes history,” say Alshaabi and colleagues.

Now anyone can study it with Storywrangler. Try it, it’s interesting.

As for Harry Potter, Frankenstein and Dracula, the tale that Storywrangler tells is different from the Google Books n-gram corpus. Harry Potter is significantly more popular than his grim-faced predecessors and always has been on Twitter. In 2011, Harry Potter was the 44th most popular term on Twitter while Dracula has never risen higher than 2653rd. Frankenstein’s best rank is 3560th.

Source – https://ksusentinel.com/ (Version 2021) Lifesciences Data Mining And Visualization Market report published by Stratagem Market Insights is an in-depth analysis of the market covering its size, share, value, growth Read More

Upgrade & Secure Your Future with DevOps, SRE, DevSecOps, MLOps!

Data Visualization

Cultural Significance

Related Posts

Introduction to Data Visualization Tools and what is the Types of Data Visualization Tools

What is Data Transformation and Key Features and Benefits of Data Transformation Tools

What is Data Cleaning and What are the Importance of Data Cleaning Tools

What is Data Mining and what is the Future of Data Mining Tools

Trending Report: Lifesciences Data Mining And Visualization Market Wrap: Now Even More Attractive| Keyplayers- Tableau Software, SAP SE, IBM, SAS Institute

KDD in data mining assists data prep for machine learning