The data
Twitter was chosen as the data source. It is one of the world’s major social media platforms, with 199 million active users in April 20214, and it is also a common source of text for sentiment analyses23,24,25.
To collect distance learning-related tweets, we used TrackMyHashtag https://www.trackmyhashtag.com/, a tracking tool to monitor hashtags in real time. Unlike Twitter API, which does not provide tweets older than three weeks, TrackMyHashtag also provides historical data and filters selections by language and geolocation.
For our study, we chose the Italian words for ‘distance learning’ as the search term and selected March 3, 2020 through November 23, 2021 as the period of interest. Finally, we chose Italian tweets only. A total of 25,100 tweets were collected for this study.
Data preprocessing
To clean the data and prepare it for sentiment analysis, we applied the following preprocessing steps using NLP techniques implemented with Python:
-
1.
removed mentions, URLs, and hashtags,
-
2.
replaced HTML characters with Unicode equivalent (such as replacing ‘&’ with ‘&’),
-
3.
removed HTML tags (such as \(< div>\), \(< p>\), etc.),
-
4.
removed unnecessary line breaks,
-
5.
removed special characters and punctuation,
-
6.
removed words that are numbers,
-
7.
converted the Italian tweets’ text into English using the ‘googletrans’ tool.
In the second part an higher quality dataset is required for the topic model. The duplicate tweets were removed, and only the unique tweets were retained. Apart from the general data-cleaning methods, tokenization and lemmatization could enable the model to achieve better performance. The different forms of a word cause misclassification for models. Consequently, the WorldNet library of NLTK26 was used to accomplish lemmatization. The stemming algorithms that aggressively reduce words to a common base even if these words actually have different meanings are not considered here. Finally, we lowercased all of the text to ensure that every word appeared in a consistent format and pruned the vocabulary, removing stop words and terms unrelated to the topic, such as ‘as’, ‘from’, and ‘would’.
Sentiment and emotion analysis
Between the major algorithms to be used for text mining and specifically for sentiment analysis, we applied the Valence Aware Dictionary for Sentiment Reasoning (VADER) proposed by Hutto et al.27 to determine the polarity and intensity of the tweets. VADER is a sentiment lexicon and rule-based sentiment analysis tool obtained through the wisdom of the crowd approach. Through extensive human work, this tool enables the sentiment analysis of social media to be completed quickly and has a very high accuracy similar to that of human beings. We used VADER to obtain sentiment scores for a tweet’s preprocessed text data. At the same time, according to the classification method recommended by its authors, we mapped the emotional score into three categories: positive, negative, and neutral (Fig. 1 step1).
Then, to discover the emotions underlying categories, we applied the nrc28 algorithm, which is one of the methods included in the R library package syuzhet29 for emotion analysis. In particular, the nrc algorithm applies an emotion dictionary to score each tweet based on two sentiments (positive or negative) and eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust). Emotional recognition aims to identify the emotions that a tweet carries. If a tweet was associated with a particular emotion or sentiment, it scores points that reflect the degree of valence with respect to that category. Otherwise, it would have no score for that category. Therefore, if a tweet contains two words listed in the list of words for the ‘joy’ emotion, the score for that sentence in the joy category will be 2.
When using the nrc lexicon, rather than receiving the algebraic score due to positive and negative words, each tweet obtains a score for each emotion category. However, this algorithm fails to properly account for negators. Additionally, it adopts the bag-of-words approach, where the sentiment is based on the individual words occurring in the text, neglecting the role of syntax and grammar. Therefore, the VADER and nrc methods are not comparable in terms of the number of tweets and polarity categories. Hence, the idea is to use VADER for sentiment analysis and subsequently to apply nrc only to discover positive and negative emotions. The flow chart in Fig. 1 represents the two-step sentiment analysis. VADER’s neutral tweets are very useful in the classification but not interesting for the emotions analysis; therefore, we focused on tweets with positive and negative sentiments. VADER’s performance in the field of social media text is excellent. Based on its complete rules, VADER can carry out a sentiment analysis on various lexical features: punctuation, capitalization, degree modifiers, the contrastive conjunction ‘but’, and negation flipping tri-grams.
The topic model
The topic model is an unsupervised machine learning method; that is, it is a text mining procedure with which the topics or themes of documents can be identified from a large document corpus30. The latent Dirichlet allocation (LDA) model is one of the most popular topic modeling methods; it is a probabilistic model for expressing a corpus based on a three-level hierarchical Bayesian model. The basic idea of LDA is that each document has a topic, and a topic can be defined as a word distribution31. Particularly in LDA models, the generation of documents within a corpus follows the following process:
-
1.
A mixture of k topics, \(\theta\), is sampled from a Dirichlet prior, which is parameterized by \(\alpha\);
-
2.
A topic \(z_n\) is sampled from the multinomial distribution, \(p(\theta \mid \alpha )\) that is the document topic distribution which models \(p(z_n=i\mid \theta )\) ;
-
3.
Fixed the number of topics \(k=1 \ldots ,K\), the distribution of words for k topics is denoted by \(\phi\) ,which is also a multinomial distribution whose hyper-parameter \(\beta\) follows the Dirichlet distribution;
-
4.
Given the topic \(z_n\), a word, \(w_n\), is then sampled via the multinomial distribution \(p(w \mid z_n;\beta )\).
Overall, the probability of a document (or tweet, in our case) “\(\mathbf w\)” containing words can be described as:
$$\beginaligned p(\mathbfw)=\int _\theta p(\theta \mid \alpha )\left( \prod \limits _n = 1^N \sum \limits _z_n = 1^k p(w_n \mid z_n ;\beta )p(z_n \mid \theta ) \right) \mathrmd\theta \endaligned$$
(1)
Finally, the probability of the corpus of M documents \(D=\\mathbfw_\mathbf1,\ldots ,\mathbfw_\mathbfM\\) can be expressed as the product of the marginal probabilities of each single document \(D_m\), as shown in (2).
$$\beginaligned p(D) = \prod \limits _m = 1^M {\int _\theta {p(\theta _m \mid \alpha )\left( \prod \limits _n = 1^N_m \sum \limits _z_n = 1^k p(w_m,n \mid z_m,n ;\beta )p(z_m,n \mid \theta _m ) \right) } } \mathrmd\theta _m \endaligned$$
(2)
In our analysis that includes tweets over a 2-year period, we find that the tweet content is changeable over time, and therefore, the topic content is not a static corpus. The Dynamic LDA model (DLDA) is adopted and used on topics aggregated in time epochs, and a state-space model handles transitions of the topics from one epoch to another. A Gaussian probabilistic model to obtain the posterior probabilities on the evolving topics along the timeline is added as an additional dimension.
Figure 2 shows a graphical representation of the dynamic topic model (DTM)32. As a part of the probabilistic topic model class, the dynamic model can explain how various tweet themes evolve. The tweet dataset corpus used here (March 3, 2020-November 23, 2021) contains 630 days, which is exactly seven quarters of a year. The dynamic topic model is accordingly applied to seven time steps corresponding to the seven trimesters of the dataset. These time slices are put into the model provided by gensim33.
An essential challenge in DLDA (as LDA) is to determine an appropriate number of topics. Roder et al. proposed coherence scores to evaluate the quality of each topic model. Particularly, topic coherence is the measure used to evaluate the coherence between topics inferred by a model. As coherence measures, we used \(C_v\) and \(C_umass\). The first is a measure based on a sliding window that uses normalized pointwise mutual information (NPMI) and cosine similarity. Instead, \(C_umass\) is based on document co-occurrence counts, a one-preceding segmentation, and a logarithmic conditional probability as confirmation measure. These values aim to emulate the relative score that a human is likely to assign to a topic and indicate how much the topic words ‘make sense’. These scores infer cohesiveness between ‘top’ words within a given topic. Also considered is the distribution on the primer component analysis (PCA), which can visualize the topic models in a word spatial distribution with two dimensions. A uniform distribution is preferred, which gives a high degree of independence to each topic. The judgment for a good model is a higher coherence and an average distribution on the primer analysis displayed by the pyLDAvis34.