Computer Science And Technology - Research Publications
Permanent URI for this collectionhttps://kr.cup.edu.in/handle/32116/82
Browse
4 results
Search Results
Item Hate Speech and Offensive Language Detection in Twitter Data Using Machine Learning Classifiers(Springer Science and Business Media Deutschland GmbH, 2023-05-03T00:00:00) Shah, Seyed Muzaffar Ahmad; Singh, SatwinderSocial media is rapidly growing in popularity and has its advantages and disadvantages. Users posting their daily updates and opinions on social media may inadvertently hurt the feelings of others. Detecting hate speech and harmful information on social media is critical these days, lest it led to calamity. In this research, machine learning classifiers such as Na�ve Bayes, support vector machines, logistic regression, and pre-trained models BERT and RoBERTa, developed by Google and Facebook, respectively, are used to detect hate speech and offensive content from Twitter data on a newly created dataset that included tweets and articles/blogs. The sentiments were obtained using the VADER sentiment analyzer. The results depicted that the pre-trained classifiers outperformed the machine learning classifiers utilized in this study. An accuracy score of 96% and 93% was scored by BERT and RoBERTa, respectively, on the tweet dataset, whereas on a dataset of articles/blogs, accuracy of 97% and 98%, respectively, was achieved by both the classifiers outperforming other classifiers used in this work. Further, it can also be depicted that neutral content is shared more in articles/blogs, hate content is mostly shared equally in both the tweets and article/blogs, whereas offensive content is shared higher in tweets than articles/blogs. � 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.Item Comparison of Public and Critics Opinion About the Taliban Government Over Afghanistan Through Sentiment Analysis(Springer Science and Business Media Deutschland GmbH, 2023-05-03T00:00:00) Reza, Md Majid; Singh, Satwinder; Kundra, Harish; Reza, Md RashidThe usage of social media has increased exponentially these days. People worldwide are sharing their opinions on different platforms such as Twitter, personal blogs, Facebook, and other similar platforms. Twitter has grown in popularity as a platform for people to express their thoughts and opinions on many different topics. The data from Twitter about the Taliban has been examined in this research work, and various machine learning algorithms have been applied including SVM, LR, and random forest. Text sentiments have been captured via TextBlob. Among the machine learning models applied, SVM outperformed all other models and achieved an accuracy score of around 94% on the tweet dataset and logistic regression outperformed other models with an accuracy score of 83% on the news article dataset. � 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.Item Semi-supervised labeling: a proposed methodology for labeling the twitter datasets(Springer, 2022-01-28T00:00:00) Jan, Tabassum Gull; Khurana, Surinder Singh; Kumar, MunishTwitter has nowadays become a trending microblogging and social media platform for news and discussions. Since the dramatic increase in its platform has additionally set off a dramatic increase in spam utilization in this platform. For Supervised machine learning, one always finds a need to have a labeled dataset of Twitter. It is desirable to design a semi-supervised labeling technique for labeling newly prepared recent datasets. To prepare the labeled dataset lot of human affords are required. This issue has motivated us to propose an efficient approach for preparing labeled datasets so that time can be saved and human errors can be avoided. Our proposed approach relies on readily available features in real-time for better performance and wider applicability. This work aims at collecting the most recent tweets of a user using Twitter streaming and prepare a recent dataset of Twitter. Finally, a semi-supervised machine learning algorithm based on the self-training technique was designed for labeling the tweets. Semi-supervised support vector machine and semi-supervised decision tree classifiers were used as base classifiers in the self-training technique. Further, the authors have applied K means clustering algorithm to the tweets based on the tweet content. The principled novel approach is an ensemble of semi-supervised and unsupervised learning wherein it was found that semi-supervised algorithms are more accurate in prediction than unsupervised ones. To effectively assign the labels to the tweets, authors have implemented the concept of voting in this novel approach and the label pre-directed by the majority voting classifier is the actual label assigned to the tweet dataset. Maximum accuracy of 99.0% has been reported in this paper using a majority voting classifier for spam labeling. � 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.Item Clustering of tweets: A novel approach to label the unlabelled tweets(Springer, 2020) Jan T.G.Twitter is one of the fastest growing microblogging and online social networking site that enables users to send and receive messages in the form of tweets. Twitter is the trend of today for news analysis and discussions. That is why Twitter has become the main target of attackers and cybercriminals. These attackers not only hamper the security of Twitter but also destroy the whole trust people have on it. Hence, making Twitter platform impure by misusing it. Misuse can be in the form of hurtful gossips, cyberbullying, cyber harassment, spams, pornographic content, identity theft, common Web attacks like phishing and malware downloading, etc. Twitter world is growing fast and hence prone to spams. So, there is a need for spam detection on Twitter. Spam detection using supervised algorithms is wholly and solely based on the labelled dataset of Twitter. To label the datasets manually is costly, time-consuming and a challenging task. Also, these old labelled datasets are nowadays not available because of Twitter data publishing policies. So, there is a need to design an approach to label the tweets as spam and non-spam in order to overcome the effect of spam drift. In this paper, we downloaded the recent dataset of Twitter and prepared an unlabelled dataset of tweets from it. Later on, we applied the cluster-then-label approach to label the tweets as spam and non-spam. This labelled dataset can then be used for spam detection in Twitter and categorization of different types of spams.