Detection of content-based cybercrime in Roman Kashmiri using ensemble learning

dc.contributor.authorFarooq, Umar
dc.contributor.authorSingh, Parvinder
dc.contributor.authorKhurana, Surinder Singh
dc.contributor.authorKumar, Munish
dc.date.accessioned2024-01-21T10:48:42Z
dc.date.accessioned2024-08-14T05:05:36Z
dc.date.available2024-01-21T10:48:42Z
dc.date.available2024-08-14T05:05:36Z
dc.date.issued2023-09-25T00:00:00
dc.description.abstractThe official language of Kashmir, Kashmiri language or Koshur, is spoken by more than 7 million people, yet its content-based cybercrime detection remains unexplored in theoretical and experimental research. Furthermore, the absence of programming libraries for sentimental analysis and a benchmark corpus has impeded advancements in this field. Challenges persist in working with diverse scripts of Kashmiri, including Perso-Arabic, Sharada, Devanagari, and Roman. Detecting cybercrime in this language is challenging due to its complex morphological nature, lack of resources, scarcity of annotated datasets, and varied linguistic characteristics, emphasizing the importance of overcoming these obstacles to develop effective detection systems. This paper attempts to detect content-based cybercrime in Roman Kashmiri script, extensively utilized on online platforms like social media, chat rooms, emails, etc., by the Kashmiri community. A well-balanced and meaningful dataset, the first of its kind in this context, is compiled, incorporating positive and negative comments, and three strategies were employed for analysis. The findings reveal that the Tf-Idf Vectorizer outperforms other tokenization methods (Count Vectorizer and Tf-Idf Transformer), bi-gram notation exhibits superior performance compared to one and tri-gram notations, and the XGBM proves to be the most effective in terms of evaluation metrics. Leveraging these strategies, Python applications were developed for text classification, successfully distinguishing cyberbullying (unsafe) from non-cyberbullying (safe) instances, with the XGBM exhibiting exceptional accuracy using the Tf-Idf Vectorizer with bi-gram, a Bag of Words, and lexical features. This pioneering research underscores the urgent need for content-based cybercrime detection advancements in the Kashmiri language, paving the way for effective detection systems to address language-specific challenges and promote a safer online environment for the Kashmiri community. Furthermore, this research opens new avenues for further advancements in detecting and preventing cybercrime in Kashmiri and potentially in other languages lacking robust cybercrime detection methodologies. � 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.en_US
dc.identifier.doi10.1007/s11042-023-16678-y
dc.identifier.issn13807501
dc.identifier.urihttps://kr.cup.edu.in/handle/32116/3928
dc.identifier.urlhttps://link.springer.com/10.1007/s11042-023-16678-y
dc.language.isoen_USen_US
dc.publisherSpringeren_US
dc.subjectCyberbullyingen_US
dc.subjectEnsemble learningen_US
dc.subjectKashmiri languageen_US
dc.subjectKoshuren_US
dc.subjectLexical featuresen_US
dc.subjectLGBMen_US
dc.subjectn-gramen_US
dc.subjectXGBMen_US
dc.titleDetection of content-based cybercrime in Roman Kashmiri using ensemble learningen_US
dc.title.journalMultimedia Tools and Applicationsen_US
dc.typeArticleen_US
dc.type.accesstypeClosed Accessen_US

Files