An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Message:
Article Type:
Research/Original Article (دارای رتبه معتبر)
Abstract:

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text classification is one of the most important parts in data mining and machine learning. Classification can be considered as the most important supervised technique which classifies the input space to k groups based on similarity and difference such that targets in the same group are similar and targets in different groups are different. Text classification system has been widely used in many fields, like spam filtering, news classification, web page detection, Bioinformatics, machine translation, automatic response systems, and applications regarding of automatic organization of documents. The important point in obtaining an efficient text classification method is extraction and selection of key features of texts. It is proved that only 33% of words and features of the texts are useful and they can be used to extract information and most words existing in texts are used to represent purpose of a text and they are sometimes repeated. Feature selection is known as a good solution to high dimensionality of the feature space. Excessive number of Features not only increase computation time but also degrade classification accuracy. In general, purpose of extracting and selecting features of texts is to reduce data volume, time required for training, computational time and increase performance speed of the methods proposed for text classification. Feature extraction refers to the process of generating a small set of new features by combining or transforming the original ones, while in feature selection dimension of the space is reduced by selecting the most prominent features. In this paper, a solution to improve support vector machine algorithm using Imperialism Competitive Algorithm, are provided. In this proposed method, the Imperialism Competitive Algorithm for selecting features and the support vector machine algorithm for Classification of texts are used. At the stage of extracting the features of the texts, using weighting schemes such as NORMTF, LOGTF, ITF, SPARCK, and TF, each extracted word is allocated a weight in order to determine the role of the words in terms of their effects as the keywords of the texts. The weight of each word indicates the extent of its effect on the main topic of the text compared to other words used in the same text. In the proposed method, the TF weighting scheme is used for attributing weights to the words. In this scheme, the features are a function of the distribution of different features in each of the documents . Moreover, at this stage, using the process of pruning, low-frequency features and words that are used fewer than two times in the text are pruned. Pruning basically filters low-frequency features in a text [18]. In order to reduce the number of dimensions of the features and decrease computational complexity, the imperialist competitive algorithm (ICA) is utilized in the proposed method. The main goal of employing the imperialist competitive algorithm (ICA) in the proposed method is minimizing the loss of data in the texts, while also maximizing the reduction of the dimensions of the features. In the proposed method, since the imperialist competitive algorithm (ICA) has been used for selecting the features, there must be a mapping created between the parameters of the imperialist competitive algorithm (ICA) and the proposed method. Accordingly, when using the imperialist competitive algorithm (ICA) for selecting the key features, the search space includes the dimensions of the features, and among all the extracted features, , , or of all the features are attributed to each of the countries. Since the mapping is carried out randomly, there may be repetitive features in any of the countries as well. Next, based on the general trend of the imperialist competitive algorithm (ICA),some countries which are more powerful are considered as imperialists, while the other countries are considered as colonies. Once the countries are identified, the optimization process can begin. Each country is defined in the form of an array with different values for the variables as in Equations 2 and 3. (2) Country = [ , , …, , ] (3) Cost = f (Country) The variables attributed to each country can be structural features, lexical features, semantic features, or the weight of each word, and so on. Accordingly, the power of each country for identifying the class of each text is increased or decreased based on its variables. One of the most important phases of the imperialist competitive algorithm (ICA) is the colonial competition phase. In this phase, all the imperialists try to increase the number of colonies they own. Each of the more powerful empires tries to seize the colonies of the weakest empires to increase their own power. In the proposed method, colonies with the highest number of errors in classification and the highest number of features are considered as the weakest empires. Based on trial and error, and considering the target function in the proposed method, the number of key features relevant to the main topic of the texts is set to of the total extracted features, and only through using of the key features of each text along with a classifier algorithm such as , support vector machine (SVM), nearest neighbors, and so on, the class of that text can be determined in the proposed method. Since the classification of texts is a nonlinear problem, in order to classify texts, the problem must first be mapped into a linear problem. In this paper, the RBF kernel function along with is used for mapping the problem. The hybrid algorithm is implemented on the Reuters21578, WebKB, and Cade 12 data sets to evaluate the accuracy of the proposed method. The simulation results indicate that the proposed hybrid algorithm in precision, recall and F Measure criteria is more efficient than primary support machine carriers.

Language:
Persian
Published:
Signal and Data Processing, Volume:17 Issue: 1, 2020
Pages:
117 to 130
https://www.magiran.com/p2143577  
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با ثبت ایمیلتان و پرداخت حق اشتراک سالانه به مبلغ 1,390,000ريال، بلافاصله متن این مقاله را دریافت کنید.اعتبار دانلود 70 مقاله نیز در حساب کاربری شما لحاظ خواهد شد.

پرداخت حق اشتراک به معنای پذیرش "شرایط خدمات" پایگاه مگیران از سوی شماست.

اگر مقاله ای از شما در مگیران نمایه شده، برای استفاده از اعتبار اهدایی سامانه نویسندگان با ایمیل منتشرشده ثبت نام کنید. ثبت نام

اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
  • حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
  • پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.
In order to view content subscription is required

Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!