A N E W A L G O R I T H M F O R C L U S T E R I N G W E B-P A G E S B A S E D O N L I N K S A N D C O N T E N T
Author(s):
Abstract:
In the midst of webpages, two issues raise for users to access the desired resources. These issues are speed and accuracy that are two important factors for user's satisfaction of web services, for which an appropriate information retrieval tool to provide suitable responses is required. Therefore, developing an efficient search engine could be useful in order to attract customers and increase their satisfaction.
However, Web search engines often face with a crucial problem, that is, their results, include highly diverse pages in correspondence with vague queries. This kind of diversity makes choosing the most relevant pages more difficult for search engines. On the other hand, the obtained results may be undesirable from the user's perspective. In such a situation, discovering natural grouping of pages and finding their representatives help the engines to cover all admissible meanings related to user's query. Clustering is the well-known approach for this reduction purpose, i.e., finding a few representatives among highly diverse Web pages.
In this paper, we focus on a pioneering algorithm and aim to improve it in terms of the quality of responses and the execution speed. To do so, we propose to provide initial clusters by means of a well-known algorithm, called K-means. This could be a proper initial point. We also reformulate a time-consuming formula of the main algorithm by taking advantages of the properties of linking network. Furthermore, we formulate a set of significant variables of the main algorithm to increase the quality of the clustering. These variables have been considered constant in the main algorithm. The experimental results on ground-truth datasets indicate that the performance of our algorithm is about 30%superior to the performance of the main algorithm both in terms of quality of clustering and execution speed.
Moreover, as an interesting case study, we execute our algorithm on the dataset of Persian blogs. We provided this dataset by collecting the information about links and texts included in some blogs. Implementing our algorithm on this interesting dataset provides marvelous results in the case of extracted clusters.
However, Web search engines often face with a crucial problem, that is, their results, include highly diverse pages in correspondence with vague queries. This kind of diversity makes choosing the most relevant pages more difficult for search engines. On the other hand, the obtained results may be undesirable from the user's perspective. In such a situation, discovering natural grouping of pages and finding their representatives help the engines to cover all admissible meanings related to user's query. Clustering is the well-known approach for this reduction purpose, i.e., finding a few representatives among highly diverse Web pages.
In this paper, we focus on a pioneering algorithm and aim to improve it in terms of the quality of responses and the execution speed. To do so, we propose to provide initial clusters by means of a well-known algorithm, called K-means. This could be a proper initial point. We also reformulate a time-consuming formula of the main algorithm by taking advantages of the properties of linking network. Furthermore, we formulate a set of significant variables of the main algorithm to increase the quality of the clustering. These variables have been considered constant in the main algorithm. The experimental results on ground-truth datasets indicate that the performance of our algorithm is about 30%superior to the performance of the main algorithm both in terms of quality of clustering and execution speed.
Moreover, as an interesting case study, we execute our algorithm on the dataset of Persian blogs. We provided this dataset by collecting the information about links and texts included in some blogs. Implementing our algorithm on this interesting dataset provides marvelous results in the case of extracted clusters.
Keywords:
Language:
Persian
Published:
Industrial Engineering & Management Sharif, Volume:33 Issue: 1, 2017
Pages:
21 to 28
magiran.com/p1753181
دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:
اشتراک شخصی
با عضویت و پرداخت آنلاین حق اشتراک یکساله به مبلغ 1,390,000ريال میتوانید 70 عنوان مطلب دانلود کنید!
اشتراک سازمانی
به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!
توجه!
- حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران میشود.
- پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانههای چاپی و دیجیتال را به کاربر نمیدهد.
In order to view content subscription is required
Personal subscription
Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.
Organization subscription
Please contact us to subscribe your university or library for unlimited access!