Magiran | فهرست مطالب نویسنده: سید مصطفی فخراحمد

ارتقاء و اصلاح فرایندهای رایج در بازشناسی نوری حروف متون فارسی با بکارگیری ویژگی های خط فارسی و الگوریتم انتقال فضا

آرش زارعیان، طیبه موسوی میانگاه*، بلقیس روشن، سید مصطفی فخر احمد

مجله جستار های زبانی، سال چهاردهم شماره 2 (پیاپی 74، خرداد و تیر 1402)، صص 363 -400

از آنجا که فن آوری بازشناسی نوری حروف اصالتا بر پایه ویژگی های خطی لاتین بنا شده است، تقریبا کلیه الگوریتم ها و مراحل مورد استفاده در نظام های رایج بازشناسی حروف فارسی نیز بر اساس همان ساختار و ویژگی های خطوط لاتین گسترش یافته اند. بکارگیری ابزار و ویژگی های خطوط لاتین در طراحی نظام های فارسی محور، نه تنها در نهایت به انجام بازشناسی صحیح حروف فارسی منجر نگردیده است، بلکه باعث سردرگمی همزمان نرم افزار و کاربر فارسی زبان نیز شده است. از اینرو، در اینجا، پس از مقدمه ای کوتاه پیرامون اهمیت خط و زبان در حوزه فن آوری اطلاعات به سیر تحول خط فارسی در دوره های مختلف و شرح ویژگی های این خط و تفاوت های آن با خطوط دیگر پرداخته شده است و عناصر شکلی این خط، با توجه به کاربرد و اهمیت آنها در تعامل کاربر با نرم افزارهای بازشناسی نوری متون فارسی، طیقه بندی گردیده است. در این بخش، با توصیف و تحلیل مراحل بازشناسی حروف بر اساس ویژگی های خط فارسی و شرح تفاوتهای آن با گونه های لاتین محور موجود، چهره ای متفاوت از دستگاه خط فارسی به هنگام کار با رایانه ها و به ویژه در سیستم های بازشناسی نوری حروف عرضه می شود بطوری که مخاطب عملا قابلیت و ظرفیت های دستگاه خط فارسی در هماوردی با دستگاه ساده خط لاتین را مشاهده خواهد نمود. با اتکا به همین ویژگی ها، در جهت ارتقاء و اصلاح الگوریتم های رایج در بازشناسی نوری حروف فارسی، تسهیل بکارگیری الگوها، و تعدیل حجم پایگاه داده ها، از فرایند انتقال هندسی فضای دو بعدی به تک بعدی نیز بهره جسته ایم.

کلید واژگان: بازشناسی نوری حروف, ا.سی.آر, الگوریتم انتقال فضا نظام, نگارشی زبان فارسی, ویژگی های خطی فارسی

Correction and Improvement of the Common Processes in Optical Character Recognition (OCR) of Persian Texts: Using the Features of the Persian Script and a Dimension Transference Algorithm

Arash Zareian, Tayebeh Mosavi Miangah*, Belghis Rovshan, Seyyed Mostafa Fakhr Ahmad

Since the technology of optical recognition of characters is essentially based on Latin script, almost all the algorithms and processes involved in the Persian OCR systems are constructed upon the structure and scriptological features of Latin alphabet. This utilization of the means and features of Latin script in order to design Persian-based OCR systems, however, not only has not resulted in the appropriate optical recognition of Persian characters but also has simultaneously ended in confusion on the part of both the Persian-speaking users and the systems. Through a step by step discussion and analysis of the processes involved in the optical recognition of characters based on the scriptological features of the Persian script, not only the deficiencies and faults of the current Latin-based OCR systems will be pinpointed but also a different aspect of the Persian writing system, in connection with its use in computer software, especially OCR systems, will be drawn so that the reader will practically notice the potentials and capabilities of this complex script in contrast to the simpler Latin writing system. In the end, in order to upgrade and improve the current algorithms employed in Persian OCR systems, the geometrical process of transferring bi-dimensional specifications into mono-dimensional ones has been utilized. The proposed algorithm, which is based on the scriptological features of the Persian script, will simultaneously result in the convenient manipulation of patterns, reduction of the bulk of the database, and acceleration of the data processing rate.

Keywords: Optical character recognition, OCR, Computational linguistics, Scribal features, Persian writing system

تاثیر کمبود و پراکندگی داده بر اثربخشی نتایج سامانه ژورنال یاب رایسست: مطالعه موردی حوزه فنی و مهندسی

نرجس ورع، مهدیه میرزابیگی*، هاجر ستوده، سید مصطفی فخراحمد، نیلوفر مظفری

پژوهشنامه پردازش و مدیریت اطلاعات، سال سی و هفتم شماره 4 (پیاپی 110، تابستان 1401)، صص 1293 -1317

عوامل متعددی از مجموعه عناصر تشکیل دهنده سامانه های پیشنهاددهنده در تولید و ارایه پیشنهاد دخیل هستند. مطالعه حاضر، با هدف شناخت تاثیر دو چالش کمبود و پراکندگی داده بر اثربخشی نتایج پیشنهادی سامانه ژورنال یاب رایسست انجام شده است. بدین منظور بیش از 15000 مقاله از نشریه های فنی و مهندسی در بازه زمانی 1392 تا 1396 از وب سایت نشریه ها گرداوری شد. در مرحله بعد عناصر متنی این مقاله ها شامل عنوان، چکیده و واژه های کلیدی استخراج، نرمال ‏سازی و پردازش شد و پایگاه داده پیکره پژوهش ایجاد گردید. بر اساس تعداد مقاله های گردآوری شده، با استفاده از فرمول کوکران تعداد 400 مقاله پایه که پیش از این در نشریه های مرتبط با موضوع منتشر شده بودند، به روش تصادفی- تناسبی، انتخاب شد. عنوان و چکیده این مقاله ها، به منظور دریافت نشریه های پیشنهادی سامانه، جهت چاپ مقاله در دو مرحله پیش و پس از بهبود دو چالش کمبود و پراکندگی داده به عنوان پرسمان وارد سامانه شد. سپس نتایج پیشنهادی در هر مرحله در قالب فایل اکسل ذخیره گردید. در نهایت میزان اثربخشی نتایج سامانه در هر مرحله، به روش اعتبارسنجی یک طرفه و بر اساس معیار دقت در k تعیین شد. فراوانی نسبی رده ها نشان داد در وضعیت موجود، نشریه هدف تنها در 26 درصد از پرسمان ها در 3 رتبه نخست پیشنهاد شده است. در راستای بهبود چالش کمبود داده با غنی سازی، نرمال سازی و پردازش داده ها اثربخشی نتایج در 3 رتبه نخست به میزان 15 درصد افزایش یافت. اما همچنان در بیش از 30 درصد پرسمان ها، نشریه هدف در رتبه 10 و بالاتر پیشنهاد شده بود. بنابراین در مرحله بعد به منظور بهبود چالش پراکندگی، دسته بندی موضوعی داده ها انجام و افزایش 30 درصدی اثربخشی نتایج نسبت به مرحله پیشین در 3 رتبه نخست حاصل گردید. بر این اساس یکی از عواملی که منجر به کاهش اثربخشی نتایج پیشنهادی سامانه ژورنال یاب رایسست می گردد، کمبود و پراکندگی داده ها است؛ که با غنی سازی پایگاه داده، بهبود فرآیند پردازش و دسته بندی موضوعی داده ها می توان به میزان قابل توجهی با این دو چالش مقابله و اثربخشی نتایج پیشنهادی سامانه را بهبود بخشید.

کلید واژگان: اثربخشی, سامانه پیشنهاد دهنده نشریه, کمبود داده, پراکندگی داده, سامانه ژورنال یاب رایسست

The Impact of Data Lack and Data Sparsity on the Effectiveness of the Results of the RICeST Journal Finder Results: A Case Study in the Field of Engineering

Narjes Vara, Mahdieh Mirzabeigi*, Shajar Sotudeh, Seyed Mostafa Fakhrahmad, Niloofar Mozafari

Journal of Information Processing and Management, Volume:37 Issue: 4, 2022, PP 1293 -1317

Several factors are involved in the production and presentation of recommender systems.The aim of this study was to investigate the effect of the two challenges lack and sparsity of data on the effectiveness of the proposed results of the RICeST Journal Finder. The corpus includes more than 15,000 articles from technical and engineering publications in the period 2013 to 2017, which have been collected from their website. Textual elements of these articles were extracted, normalized and processed, and a research body database was created. Based on the number of collected articles, by using Cochran's formula, 400 basic articles that previously published in related to the topic of each journal were selected in a random-proportional method. Title and abstract of these articles as a query, in order to receive the system Journals suggested, to print the article in two stages of before and after improving the two challenges of lack and sparsity of data in the test corpus. The suggested results in each step were saved in Excel. Finally, the effectiveness of the system results in each stage was determined by Leave-one-out Cross-Validation method and based on the accuracy criterion in k.The relative abundance of categories showed that, in the current situation, the target journal was suggested in only 26% of searches in the first 3 ranks. After enriching, normalizing and processing the data and thus improving the lack of data challenge, although 30% of the results were still rated above 10; But the accuracy of the results in the first 3 ranks increased by 15%. Also, after thematically categorizing the data with the aim of improving the sparsity challenge, 30% increase in the accuracy of the system results in the first 3 ranks compared to the previous step was achieved. The results of this study showed that enriching the database, improving the processing process and thematic classification of data in RICeST journal finder can reduce the two challengs lack and sparsity of data and increase the effectiveness of the proposed results of this systems.

Keywords: Efficiency, Journal Finder, lack of Data, Data Sparsity, RICeST Journal Finder

مقایسه دیدگاه پژوهشگران حوزه فنی- مهندسی و علوم انسانی در ارتباط با اهمیت معیارهای ارسال مقاله به نشریه و میزان ربط موضوعی نتایج پیشنهادی سامانه ژورنال یاب رایسست

نرجس ورع، مهدیه میرزابیگی*، هاجر ستوده، سید مصطفی فخراحمد

نشریه علوم و فنون مدیریت اطلاعات، سال هشتم شماره 2 (پیاپی 27، تابستان 1401)، صص 53 -72

هدف

این مطالعه با هدف مقایسه دیدگاه پژوهشگران حوزه فنی- مهندسی و علوم انسانی در ارتباط با اهمیت معیارهای ارسال مقاله به نشریه و میزان ربط موضوعی نتایج پیشنهادی سامانه ژورنال یاب رایسست انجام شده است.

روش پژوهش

پژوهش حاضر به لحاظ هدف کاربردی و روش گردآوری داده ها پیمایشی است. گام اول مطالعه، مبتنی بر پرسشنامه محقق ساخته و دیدگاه پژوهشگران و گام دوم بر اساس سیاهه وارسی مشتمل بر عناصر متنی مقالات و نظر متخصصان موضوعی/ داوران انجام شده است.

نتایج

یافته ها نشان داد معیارهای بررسی کارشناسانه/ داوری، ربط موضوعی مقاله با دامنه موضوعی نشریه و داشتن ضریب تاثیر، از دیدگاه پژوهشگران در هر دو حوزه مورد بررسی، دارای بیشترین میزان اهمیت و قدمت نشریه در رتبه آخر قرار داشت. همچنین سنجش میزان ربط موضوعی نتایج پیشنهادی سامانه، بر اساس نظر متخصصان نشان داد در بیش از 85 درصد پرس-وجوها، نشریه پیشنهادی برای مقاله مورد نظر کاملا مرتبط است و تفاوت معنی دار آماری بین نظر متخصصان/داوران در این دو حوزه وجود ندارد.

نتیجه گیری

با توجه به امکان پالایش نتایج پیشنهادی بر اساس معیارهای دارای اولویت، بنظر میرسد استفاده از این سامانه می-تواند به عنوان ابزار کمکی برای پژوهشگران مفید واقع شود.

کلید واژگان: انتشار مقاله, پژوهشگران, رایسست, ژورنال یاب, معیارهای انتخاب نشریه, نشریه مرتبط

Comparison of the views of researchers in the field of engineering and humanities about the importance of criteria for submitting an article to the journal and the degree of thematic relevance of the proposed results of the RICeST Journal Finder

Narjes Vara, Mahdieh Mirzabeigi *, Hajar Sotudeh, Seyed Mostafa Fakhrahmad

Journal of Sciences and Techniques of Information Management, Volume:8 Issue: 2, 2022, PP 53 -72

Aim

This study aims to compare the researchers' views in two fields about the importance of criteria for submitting an article to the journal and the degree of thematic relevance of the proposed results of the RICeST Journal Finder.

Methodology

The research is a survey in terms of applied purpose and the data collection method and was done in two steps; First, due to the lack of a standard questionnaire, while studying the literature, important and common criteria for the researchers of the two fields were extracted and a researcher-made questionnaire consisting of 13 criteria was prepared. The face and content validity of the questionnaire was done by 10 experts in information science and epistemology. Then, in order to identify the importance of the criteria as well as the thematic relevance of the system's proposed results, subject matter experts (reviewers) in two fields from all over Iran were used.

Findings

Criteria of Peer review, thematic relevance of the article with the thematic scope of the journal, and having an impact factor from the perspective of researchers in both groups, were the most important and the age of the journal was the least. Measuring the thematic relevance of the results suggested by the system using the opinions of experts showed that in more than 85% of the queries, the proposed publication for the article is completely relevant. There is no statistically significant difference between the opinions of experts in these two areas. It is necessary to explain that the evaluation of the thematic relevance of the results was done after improving the existing challenges in the RICeST journal finder.

Conclusion

There are many Journals in various scientific fields, so authors are facing challenges to find the most appropriate journal to publish their research findings. The results showed that the importance of the criteria of selecting the journal, from the viewpoint of national researchers, is consistent with the findings of international studies. However, considering the mental variables of researchers in different conditions, it is not possible to consider a single factor category to choose a journal to publish a manuscript; However, the possibility of refining the results, based on the priority criteria of the researchers has been introduced the use of Journal finder systems as an auxiliary tool that can be found more quickly and easily to a list of related publications. Among the other factors that can be examined objectively and are important, regardless of the author's priorities and limitations, is the thematic connection of the manuscript with the journal to which the manuscript is to be sent and published, which is the basis of the performance of the Journal finder systems. In general, according to the obtained results, the authors can use the RICeST journal finder at the national level to obtain relevant journals to publish the manuscript.

Keywords: Related Journal, Article Publishing, Journal Selection Criteria, Researchers, RICeST, Journal Finder

ارائه روشی نوین برای استخراج خودکار چهریزه ها در جستجوهای چهریزه ای (مورد مطالعه: حوزه زنان و زایمان)

عبدالحسین فرج پهلو، فریده عصاره، سید مصطفی فخراحمد، لیلا دهقانی*

پژوهشنامه پردازش و مدیریت اطلاعات، سال سی و هفتم شماره 3 (پیاپی 109، بهار 1401)، صص 807 -837

هدف این پژوهش ابداع و معرفی الگوریتمی نوین برای استخراج چهریزه ها ست که امکان تجربی شناسایی چهریزه ها با کمک پشتوانه انتشاراتی را فراهم می کند. الگوریتم پیشنهادی بر مبنای دو ایده شکل گرفته است: ایده اول این است که چهریزه در بافت بروز پیدا می کند. بنابراین برای تشخیص چهریزه در یک بدنه متنی بایستی بافت یا بستر آن مورد بررسی قرار گیرد و ایده دوم این است که چهریزه نقطه تمرکز در یک درخت واژگانی است که نه بسیار عام و نه بسیار خاص است. در حوزه پزشکی، دامنه زنان و زایمان به عنوان بستر آزمون انتخاب گردید. سه پیکره ی متنی از درون پشتوانه انتشاراتی انتخاب شد. پیکره ی بستر، از چکیده و عنوان مجموعه مقالات موجود در 20 مجله برتر حوزه انتخاب شد که در برگیرنده 167071 سند بود. پیکره دوم، پیکره منشاء بود که 2000 مقاله به صورت تصادفی از پیکره بستر، انتخاب شد. پیکره سوم، پیکره واژگانی است که با استفاده از یک سرویس تحت وب و معیار رتبه بندی واژگان LIDF-value استخراج گردید. خروجی حاصل، در برگیرنده 514 واژه بود. واژگان تکراری حذف شدند و در نهایت 480 واژه مهم شناسایی شد. سپس، واژگان در پیکره بستر با کمک مجموعه راهنما یعنی Mesh ، بسط داده شد و پس از آن بر اساس دو شرط انتقال مبتنی بر تکرار یعنی بیشتر بودن اسناد مرتبط با واژه در بستر نسبت به منشاء و انتقال مبتنی بر رتبه یعنی رشد رتبه موجود واژه در پیکره بستر نسبت به منشاء که نشان دهنده عام شدن واژه است، چهریزه های کاندید استخراج شدند. در نهایت با استفاده از سه قاعده ی اخص بودن، جایگزنی و اعم بودن، چهریزه های شناسایی شده اصلاح و نام گذاری شدند. در نهایت 26 چهریزه به عنوان چهریزه های حوزه زنان و زایمان شناسایی شدند. با مقایسه الگوریتم پیشنهادی با دیگر الگوریتم ها مشخص شد که ایجاد سه افراز (افراز منشاء و بدنه متنی و افراز برای شناسایی واژگان مهم) و مقایسه رفتار واژه در آنها و سپس ایجاد درخت بر اساس چهریزه های کاندید یعنی ترکیب رویکرد آماری و هرس درخت می تواند نتایج مناسب تری نسبت به رویکرد صرفا آماری یا هرس درخت داشته است. همچنین، مقایسه چهریزه های خروجی از الگوریتم و چهریزه های سنتی در این زمینه نشان داد که چهریزه های خروجی الگوریتم، خرد تر و برای مرور در ابزارهای بازیابی اطلاعات مفید تر هستند. همچنین،در این پژوهش مشخص شد که چهریزه های دامنه تخصصی از چهریزه های عمومی در حوزه پزشکی متفاوت است و مستقل از آنها قابل شناسایی و تعریف است اما نمی توان، نتایج را به تمامی دامنه های پزشکی تعمیم داد و نیاز است پژوهش های دیگری در دیگر حوزه ها صورت گیرد.

کلید واژگان: بازیابی اطلاعات, چهریزه, جستجوی چهریزه ای, استخراج خودکار چهریزه

Introducing a novel method for Automatic facet extraction in the faceted search (Case Study: gynecology and obstetrics domain)

Abdolhossein Farajpahlou, Farideh Osareh, Seyed Mostafa Fakhrahmad, Leila Dehghani*

Journal of Information Processing and Management, Volume:37 Issue: 3, 2022, PP 807 -837

In this research, a new algorithm for facets extraction has been developed and introduced, which provides the experimental possibility of identifying facets based on a literary warrant. In the field of automatic facet extraction, two main ideas were considered by reviewing the researches. The first idea is that the facet appears in the context. Therefore, to identify the facet in a corpus, its context must be examined. The second idea is that the facet is the focal point in a lexical tree that is neither very general nor very specific. Based on these two ideas, first, the corpus in the medicine area and the obstetrics and gynaecology domain was prepared. The research team selected three corpora from the literary warrant and used the abstract and title of the collection of articles in the top 20 journals of the field to create a contextual corpus. This collection contained 167071 documents. 2000 articles were randomly selected to create the origin corpus. The third body is the lexical corpus. The proper words of the corpus were extracted using a web-based service. The output contained 514 words. Duplicate words were removed and finally, 480 important words were identified. Then, the words were expanded in the contextual corpus with the help of the guide set- Mesh and then-candidate dissertations were extracted based on the two conditions of frequency-based Shifting and rank-based Shifting. Finally, using the three rules of specificity, substitution, and generality, the identified facets were modified and named. Finally, 26 facets were identified in the domain of gynaecology and obstetrics. Comparing the proposed algorithm with other algorithms, it was found that the combination of statistical approach and tree pruning can have better results than purely statistical approach or tree pruning. Also, the comparison of the output facets of the algorithm with the traditional facets in this obstetrics and gynaecology domain showed that the output of the algorithm is smaller and more useful for browsing information retrieval tools. Also, in this study was specified that specialized domain facets are different from general facets and can be redefined independently, but the results cannot be generalized to all medical domains and other research needs to be done in other fields.

Keywords: data retrieval, facet, faceted search, automatic facet extraction

تحلیل سنجه های استنادمحور برای تعیین میزان ربط مقاله ها

مرضیه گل تاجی، جواد عباس پور*، عبدالرسول جوکار، سید مصطفی فخراحمد، علیرضا نیک سرشت

نشریه مطالعات کتابداری و سازماندهی اطلاعات، سال سی و دوم شماره 3 (پیاپی 127، پاییز 1400)، صص 56 -76

هدف

شناخت توانایی سنجه های استنادمحور (هم استنادی، زوج کتاب شناختی، امسلر، پیج رنک و هیتس(اعتبار و کانون)) برای تعیین میزان ربط مقاله ها با یکدیگر.

روش

پژوهش حاضر از نظر هدف، کاربردی و از لحاظ شیوه گردآوری داده ها، پژوهشی توصیفی از نوع همبستگی است. جامعه آماری، مجموعه مقالات موجود در زیرمجموعه دسترسی آزاد پاب مد سنترال مجموعه آزمون سایترک بود که بر اساس سه سنجه هم استنادی، زوج کتاب شناختی و امسلر با سایر مقالات رابطه استنادی داشتند. از میان 26262 مقاله، 30 مقاله به عنوان مقالات پایه انتخاب شد و مقالات مرتبط با هر یک از آن ها بر اساس سنجه ربط مش بازیابی گردید؛ هر یک از سنجه های استنادمحور متغیر مستقل و سنجه ربط مش متغیر وابسته بود. با استفاده از نرم افزار شبیه ساز ومپ سرور و پی.اچ.پی.مای ادمین یک پایگاه مای. اس. کیو.ال ایجاد شد؛ سپس، با مطالعه کلیه کدهای مورد نیاز از بسته کد منبع سایترک، کدهای لازم با اعمال تغییرات ضروری، اجرا و نتایج حاصل در پایگاه مای. اس. کیو.ال وارد شد. با نوشتن پرس وجو به زبان اس. کیو.ال، شبکه استنادی مجموعه به صورت کامل استخراج شد سپس با کدنویسی به زبان پایتون اعداد مربوط به پیج رنک و هیتس (اعتبار و کانون) به صورت جداگانه محاسبه گردید.

یافته ها

نتایج نشان داد تمامی شش سنجه در سطح یک صدم همبستگی معنادار و مثبت با میزان ربط مقاله ها داشت؛ به عبارت دیگر، با افزایش مقادیر هریک از سنجه ها، درجه ربط مقاله ها نیز افزایش یافت. بیشترین میزان همبستگی مربوط به سنجه امسلر و پس از آن، زوج کتاب شناختی بود. پس از سنجه های امسلر و زوج کتاب شناختی، بیشترین همبستگی میان متغیر هیتس(اعتبار) با ربط مقاله ها بود. متغیر پیج رنک در مرتبه چهارم قرار داشت؛ در نهایت، کم ترین میزان همبستگی با ربط مقاله ها، مربوط به سنجه های هم استنادی و هیتس(کانون) بود؛ بنابراین، از میان سنجه های استنادی بررسی شده در این پژوهش، سنجه های امسلر، زوج کتاب شناختی، هیتس(اعتبار) و پیج رنک بیش از سایر سنجه ها از پتانسیل لازم برای تعیین میزان ربط مقاله ها برخوردار بودند.

نتیجه گیری

بر اساس یافته های پژوهش می توان گفت سنجه های استنادمحور مطالعه شده قادرند درجه ربط مقاله ها را برآورد کنند و در بافتارهای مختلف بازیابی اطلاعات شامل موتورهای جست وجو، پایگاه های اطلاعاتی و استنادی، سامانه های پیشنهاددهنده و حتی کتابخانه های دیجیتالی برای دسترسی به مقالات مرتبط، پیشنهاد مقالات مشابه و رتبه بندی نتایج بازیابی کاربرد داشته باشند؛ همچنین، لازم است به سنجه امسلر که نسبت به دو سنجه سنتی هم استنادی و زوج کتاب شناختی، در سامانه های اطلاعاتی کمتر استفاده شده است، بیش از پیش توجه شود؛ از طرفی، علیرغم اینکه سنجه هم استنادی در برخی از پایگاه ها و سامانه های بازیابی اطلاعات بین المللی(مانند ساینس دایرکت و سایت سیر) برای بازیابی مدارک مرتبط و پیشنهاد مدارک مشابه استفاده می شود در مقایسه با سایر سنجه ها از کارایی کمتری برخودار است.

کلید واژگان: ربط مقاله ها, هم استنادی, زوج کتاب شناختی, امسلر, پیج رنک, هیتس, سنجه های استنادمحور

Analysis of Citation-based Indicators to Determine the Relevance of Articles

M. Goltaji, J. Abbaspour *, A. Jowkar, S.M. Fakhrahmad, A. Nikseresht

Librarianship and Informaion Organization Studies, Volume:32 Issue: 3, 2021, PP 56 -76

Purpose

The present study aimed to investigate the potential of citation-based indicators (Co-Citation, Bibliographic Coupling, Amsler, PageRank, HITS) to determine the relevance of articles.

Method

This is applied research with correlational approach. The population consisted of 26,262 articles in the PubMed Central open access subset of the CITREC, which had citation relationship with other articles based on all three traditional citation-based indicators (Co-Citation, Bibliographic coupling, Amsler). From among the citations in the research population, 30 were selected as basic ones, and the full-text of them were retrieved based on the mesh similarity. Then the similarities among the retrieved documents were extracted based on citation-based indicators. Each of the citation-based metrics was considered as independent variable and the mesh similarity as dependent variable. A MySQL database was created using WampServer simulation software and PHP My Admin. Then, using online demo of the CITREC test collection, an output was prepared. By entering the output into the MySQL database which contains the research data set, the main structure of its tables was created. Finally, by studying all the required codes from the CITREC source code package, we attempted to enter the required codes by applying necessary changes. The results were entered in the created MySQL database. By writing a query in SQL language, the set citation network was completely extracted and stored in a Comma-separated values (CSV) file. Then, a program was written in Python that could open and process this large file and calculate PageRank and HITS numbers (authority and Hub).

Findings

The results showed that all six measures studied had a significant and positive correlation with the relevance of articles. In other words, with increasing the values of each measure, the degree of relevance of the articles also increased. The highest correlation with the relevance of the articles belonged to the Amsler measure, followed by the Bibliographic Coupling. After Amsler and Bibliographic Coupling, the highest correlation was observed in the HITS(Authority) variable, and the PageRank variable was in the fourth place; Finally, the lowest correlation with the relevance of the articles was related to the Co-Citation and the HITS (Hub). Therefore, among the known Citation- based measure studied here, Amsler, Bibliographic Coupling, HITS(Authority) and PageRank metrics, respectively, had more potential to determine the relevance of articles rather than others.

Conclusion

Based on the findings, it can be concluded that the citation-based metrics studied are able to estimate the degree of relevance of articles. Therefore, they can be used in various information retrieval platforms, including search engines, citation- based databases, recommender systems, and even digital libraries to access articles, suggest similar articles, and rank retrieved results; Also, the Amsler measure as the less used in information retrieval systems than the two traditional Measure (Co- Citations and Bibliographic Coupling) needs to be considered more than ever. On the other hand, despite the fact that Co- Citations measure is used in some international information retrieval databases (such as Science Direct and CiteSeer) to retrieve relevant documents and suggest similar documents, it is less efficient than other metrics.

Keywords: Citation- based metrics, Relevance of Articles, Co-citation, Bibliographic Coupling, Amsler, PageRank, HITS

استخراج کلمات و عبارات کلیدی از متون فارسی(مروری بر پژوهش های صورت گرفته)

عاطفه کلانتری*، عبدالرسول جوکار، سید مصطفی فخراحمد، جواد عباس پور، هاجر ستوده، مسعود مرتضوی نصرآباد، امیر جوادی، زهرا پوربهمن

پژوهشنامه پردازش و مدیریت اطلاعات، سال سی و ششم شماره 2 (پیاپی 104، زمستان 1399)، صص 563 -592

استخراج کلمات/ عبارات کلیدی متن، پیش‏‏نیاز بسیاری دیگر از وظایف حوزه پردازش زبان طبیعی است. اما بررسی متون فارسی و انگلیسی این حوزه نشان ‏می ‏دهد، تلاش‏های انگشت‏شماری برای استخراج کلمات/ عبارات کلیدی از متون فارسی صورت گرفته است. لذا، این مقاله، ‏با هدف تعیین موقعیت کنونی پردازش زبان طبیعی فارسی و ‏به‏طور خاص استخراج کلمات/ عبارات کلیدی از متون فارسی، ‏به‏ مرور خلاصه‏‏‏‏ای ‏از مقالات فارسی و انگلیسی منتشر‏شده در این حوزه که از متون فارسی برای آزمودن ایده‏هایشان استفاده کرده‏‏‏اند‏، ‏می‏پردازد؛ سپس هر مقاله را از نظر روش‏‏شناسی، نحوه اجرا و ‏پیاده‏سا‏‏زی، روش ارزیابی و معیارهای آن مورد تعمق قرار داده و به چالش ‏می‏کشد.در مجموع 14 مقاله فارسی و 6 مقاله انگلیسی به استخراج کلمات و عبارات کلیدی از متون فارسی پرداخته‏ اند‏. روش بیشتر این مقالات، استفاده از اطلاعات آماری و ‏زبان‏‏‏شناختی بوده ‏است. اکثر این مقالات یا در روش‏شناسی انتخاب‏ شده ایراد دارند و یا نویسندگان نتوانسته‏ اند‏ ایده پیشنهادی‏شان را ‏به ‏وضوح برای خواننده تبیین نمایند. ‏ در بسیاری از مقالات، از مجموعه داده استانداردی برای ارزیابی سیستم استفاده نشده و نحوه محاسبه معیارهای ارزیابی مبهم یا دارای اشکال است.در مجموع، ‏به ‏جز 3 مقاله که روش اجرا‏شده را ‏به ‏نحو نسبتا قابل‏قبولی گزارش کرده‏اند‏، سایر مقالات قابلیت تکرار‏پذیری و تعمیم ندارند. لذا نمی‏توان از آن‏ها ‏به‏ عنوان معیار پایه‏‏ای ‏برای ارزیابی سیستم‏های آینده استفاده کرد یا از ایده مطرح‏ شده در آن‏ها با اطمینان در ساخت و توسعه نرم‏افزارهای کاربردی و عملی در حوزه استخراج کلمات کلیدی استفاده نمود.

کلید واژگان: استخراج کلمات کلیدی, استخراج عبارات کلیدی, پردازش زبان طبیعی, زبان فارسی, بررسی مروری

Keyword and phrase Extraction from Persian texts: a review of the literature

Atefeh Kalantari*, Abdolrasool Jowkar, Seyed Mostafa Fakhrahmad, Javad Abbaspour, Hajar Sotudeh, Massoud Mortazavi, Amir Javadi, Zahra Pourbahman

Journal of Information Processing and Management, Volume:36 Issue: 2, 2021, PP 563 -592

Keyword and phrase extraction is a prerequisite of many natural language processing tasks. However, a review on the related Persian and English literature showed that a few studies have already been done on how to extract keywords and phrases from Persian texts. Thus, Aiming to shed light on the research status of Keyword and phrase extraction from Persian texts, the present study reviews the Persian and English publications which have assessed their research ideas over Persian texts. We also focus on each of the studies to challenge their methodologies, implementations and evaluation methods and measures.To our knowledge, a total number of 14 Persian and 6 English papers exist which have worked on the extraction of Persian keywords and phrases. Investigating on the papers revealed that they were mostly based on statistical and linguistic information. A majority of the papers suffered from the lack of either appropriate methodologies or lucid explanation of their research ideas. They generally used non-standard datasets and vague or problematic metrics to evaluate the experimental systems. Generally speaking, except for 3 papers that appropriately reported their proposed methods, the other papers lacked reproducibility and generalizability. Hence, their results cannot be confidently used as a benchmark in evaluating future works, and their proposed ideas cannot be employed in developing applications for extraction of key words and phrases from Persian texts.

Keywords: extraction, key words, key phrases, natural language processing, Persian language, review

سنجش شباهت نظرات داوری آزاد و محتوای مقالات علمی به روش پردازش زبان طبیعی

کیانوش رشیدی شریف آباد، هاجر ستوده*، مهدیه میرزابیگی، سید مصطفی فخراحمد

نشریه مطالعات کتابداری و سازماندهی اطلاعات، سال سی و یکم شماره 2 (پیاپی 122، تابستان 1399)، صص 86 -103

هدف

شناسایی قابلیت داوری های آزاد در بازشناخت مقالات پزشکی براساس شباهت آنها به مقالات مربوط.

روش شناسی

آزمونی متشکل از 2212 مقاله اف هزار ریسرچ و نظرات داوری آنها ساخته شد. 100 مقاله به عنوان مدرک پایه به صورت تصادفی انتخاب شد. شباهت نظرات داوری و محتواهای مدارک براساس سنجه شباهت کسینوسی مقادیر TF-IDF در سطح تک واژه ها و دوواژه ها محاسبه شد. شباهت محتوا و نظرات با تحلیل همبستگی اسپیرمن تحلیل شد. صحت پیش بینی شباهت محتوای مقالات براساس شباهت نظرات دریافت شده به کمک منحنی مشخصه عملکرد سامانه آزمون شد.

یافته ها

توان نظرات داوران در بازشناخت مقالات مشابه تایید شد. میان محتوا و نظرات، همبستگی معنادار وجود دارد. منحنی های تحلیل عملکرد سامانه نیز نشان داد شباهت نظرات داوری، خواه در سطح تک واژه ها و خواه دوواژه ای ها توانایی شناسایی مقالات با محتوای مشابه را دارد.

نتیجه گیری

اعتبار نظرات داوران ریشه در توان تخصصی و شناختی آنان دارد. بنابراین، نظرات می توانند در شبکه مدارک، در زمره منابع مرتبط اثربخش در بازشناخت مدارک به شمار آیند. این یافته راه را برای پژوهش در کاربرد نظرات کاربران در حوزه های بازیابی، ارزیابی، یا طبقه بندی متون هموار می کند که شباهت محتوایی در آنها اهمیت دارد.

کلید واژگان: نظرات کاربران, داوری آزاد, پردازش زبان طبیعی, شباهت, منحنی تحلیل عملکرد سامانه

K. Rashidi Sharifabad, H. Sotodeh *, M. Mirzabeigi, S.M. Fakhrahmad

Librarianship and Informaion Organization Studies, Volume:31 Issue: 2, 2020, PP 86 -103

Purpose

The social web provides a platform for publicizing open peer review reports. In this sphere, journal readers, authors, editors, and reviewers can involve in multilateral discussions on the reviewed papers and share their comments and viewpoints on the merits and probable pitfalls of papers. Open peer review comments may, hence, reflect the features of their mother articles. To identify this potential, the present study investigates to what extent similar comments accurately predict similar papers.

Methodology

Applying natural language processing techniques, it analyzes the contents of a sample of papers in medicine and life sciences and the comments received by them. To do so, a test collection is built from the papers openly published on F1000Research, an open access publishing platform that adheres to an open peer reviewing process by transparently providing the public with peer review reports, authors’ responses, and users’ comments. The test collection consists of 2212 papers and their comments. 100 papers are randomly selected as seed documents that serve as queries. The similarities between the comments and the contents of the papers are calculated using Cosine similarity of TF-IDF values. The TF-IDF values are calculated for both unigrams and bigrams extracted from the contents and comments. The correlation between the content and comment similarities is analyzed using Spearman correlation, given the non-normality of the data distributions. The accuracy of prediction of the papers’ content similarity by the similarity of their comments is tested using Receiver Operating Characteristic (ROC) curves.

Findings

The results of the Spearman correlation revealed a significant correlation between the content and comment similarities. This signifies that similar papers are more likely to receive similar comments and vice versa. The ROC curves show that similar comments can significantly identify similar papers, either at unigram or bigram level. The prediction is highly accurate.

Conclusion

Similar comments are effective in representing similar papers. In other words, similar comments are expected to present similar papers. This finding has implications for interactive information retrieval systems, where users are interested in reading experts’ comments on a given paper before viewing or downloading the paper itself. The findings also may pave the path towards new studies about the application of the comments in such spheres as information retrieval, evaluation or classification, where content similarity is of importance.

Keywords: comments, peer reviewing, Natural Language Processing, similarity, ROC curve analysis

استخراج چهریزه های حوزه موضوعی زنان و زایمان بر اساس رویکرد کاربرمدار

عبدالحسین فرج پهلو، فریده عصاره، سید مصطفی فخراحمد، لیلا دهقانی*

نشریه مدیریت اطلاعات سلامت، سال شانزدهم شماره 6 (پیاپی 70، بهمن و اسفند 1398)، صص 285 -293

مقدمه

اگرچه مفهوم تحلیل چهریزه ای در رده بندی و سیستم های بازیابی اطلاعات قدمتی طولانی دارد، اما به کارگیری رویکرد تحلیل چهریزه در سیستم های بازیابی امروزی با مشکلاتی همراه است که یکی از این مشکلات، عدم توجه مناسب به کاربر به عنوان ذی نفع اصلی سیستم می باشد. هدف از انجام پژوهش حاضر، ارایه روشی برای استخراج چهریزه های مناسب در سیستم های بازیابی اطلاعات نوین با استفاده از رویکرد کاربرمدار بود.

روش بررسی

برای درک نیاز کاربران و دستیابی به چهریزه های حوزه تخصصی زنان و زایمان، از روش تحلیل محتوای قراردادی با رویکرد کیفی استفاده شد. ابتدا با 14 متخصص مامایی و زنان و زایمان مصاحبه صورت گرفت و نیازهای اطلاعاتی گروه کاربری شناسایی گردید. سپس نیازهای اطلاعاتی با کمک متخصصان حوزه موضوعی طبقه بندی و چهریزه ای به هر طبقه نسبت داده شد. به منظور ارزیابی مفید بودن چهریزه های استخراج شده، از یک گروه خبره متشکل از 8 متخصص موضوعی و 8 متخصص کتابداری و اطلاع رسانی پزشکی استفاده گردید و توافق بر اساس فرمول توافق کل مورد ارزیابی قرار گرفت.

یافته ها

بر اساس کد های استخراج شده از مصاحبه های مربوط به بخش تعیین نیاز های اطلاعاتی ذی نفعان حوزه زنان و زایمان، 23 به دست آمد که از میان آن ها، 9 چهریزه «گروه سنی، ارگان، روش های درمانی، تشخیص، بیماری، علایم و نشانه ها، عامل خطر، عارضه و پیش آگهی» با دریافت ضریب توافق بالای 80 درصد، به عنوان چهریزه های مناسب توسط خبرگان شناسایی شد.

نتیجه گیری

استخراج چهریزه های سیستم های بازیابی اطلاعات بر اساس رویکرد کاربرمدار، سبب می شود که چهریزه ها از حالت عمومی به تخصصی تبدیل گردد. در این صورت، چهریزه ها برای کاربران هر حوزه تخصصی در رابط کاربری متفاوت خواهد بود و بدین ترتیب رابط های کاربری تخصصی شکل می گیرد.

کلید واژگان: رفتار اطلاع یابی, ذخیره سازی و بازیابی اطلاعات, طبقه بندی

The User-oriented Approach for Facet Extraction in Gynecology and Obstetrics Domain

Abdolhossein Farajpahlou, Farideh Osareh, Seyed Mostafa Fakhrahmad, Leila Dehghani*

Health Information Management, Volume:16 Issue: 6, 2020, PP 285 -293

Introduction

Although the concept of facet analysis has a long background in the classification and information retrieval systems, the use of facet analysis approach in information retrieval systems has been associated with drawbacks. One of these drawbacks is lack of proper attention to the user as the main stakeholder of the system. In this study, a method is presented for appropriate facet extraction in the modern information retrieval systems.

Methods

In order to perceive the need of users and achieve the facets of gynecology and obstetrics, the Contractual Content Analysis method with a qualitative approach was employed. First, the information needs of the user group were identified after having interviews with 14 specialists in the fields of gynecology and obstetrics. Then, the information needs were classified with the help of specialists in the subject area and a facet was attributed to each stage. An expert group consisting of eight subject-area specialists and eight specialists in knowledge and information Science evaluated the efficiency of the extracted facets; this way, the agreement was evaluated based on the total agreement formula.

Results

Based on the codes extracted from the interviews related to determining the information needs of stakeholders in the domains of gynecology and obstetrics, 23 facets were identified, 9 of which were identified as proper facets including Age groups, Organ, Therapeutics, diagnosis, Disease, symptoms or Finding, risk factor, Complication, Prognosis by the experts through receiving a coefficient of agreement above 80%.

Conclusion

Facet extraction of information retrieval systems based on the user-oriented approach converts the facets from general to specialized states. In this case, the facets are different for the users of each specialized domain in the user interface; thus, the specialized user interfaces would be formed

Keywords: Information Seeking Behavior, Information Storage, Retrieval, Classification

طراحی و پیاده سازی سامانه شناسایی و تصحیح خطای املایی متون فارسی مبتنی بر معنای واژگان

محمدباقر دستغیب*، سارا کلینی، سید مصطفی فخراحمد

فصلنامه پردازش علائم و داده ها، سال شانزدهم شماره 3 (پیاپی 41، پاییز 1398)، صص 117 -128

طراحی و پیاده سازی ابزارهای پردازش زبان طبیعی فارسی، بر اساس ویژگی های خاص این زبان، همواره با چالش هایی مواجه است. با توجه به این که سامانه های تصحیح املای خودکار در حوزه های مختلفی از قبیل تصحیح پرس و جوها، بررسی املای واژگان در اینترنت و برنامه های ویراستاری متنی کاربرد دارد، لازم است تا برای زبان فارسی نیز نرم افزارهای مناسب ایجاد شود. در این مقاله ابتدا مقدمه ای در خصوص انواع خطاهای املایی، راه کارهای شناسایی و تصحیح خطاها شرح داده شده و سپس به معرفی سامانه پارسی اسپل که بر اساس معنای واژگان فارسی، خطاها را شناسایی و تصحیح می کند، می پردازیم. با توجه به نتایج حاصله از ارزیابی سامانه پارسی اسپل با سایر نرم افزارهای مشابه رایج، مشخص شد که سامانه پارسی اسپل به عنوان ابزار موثری جهت شناسایی و پیشنهاد واژه های صحیح برای خطاهای غیر واژه و واژه حقیقی است. در مراحل شناسایی و پیشنهاد، معیارF- به صورت معناداری بهبود یافته است. همچنین نتایج ارزیابی نشان داده که سامانه پارسی اسپل خطاهای واژه حقیقی بیشتری را شناسایی کرده و قادر به ارائه و پیشنهاد واژه های جایگزین صحیح، برای واژه های نادرست است و مقدار معیار بازخوانی در شناسایی خطای واژه حقیقی به صورت معناداری بیشتر از نرم افزارهای رقیب آن است.

کلید واژگان: سیستم خطا یاب فارسی, تصحیح خطای واژگان, شناسایی خطای واژگان, پردازش زبان طبیعی, مدل زبان فارسی

Design and implementation of Persian spelling detection and correction system based on Semantic

M.B. Dastgheib*, Sara Koleini, S.M. Fakhrahmad

Signal and Data Processing, Volume:16 Issue: 3, 2019, PP 117 -128

Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors. Also developing Persian tools will provide Persian programs to check spell and reduce errors in electronic texts. In this work, we review the spelling detection and correction methods, especially for the Persian language. The proposed algorithm consists of two steps. The first step is non-word error detection and correction by intelligent scoring algorithm. The second step is read-word error detection and correction. We propose a spelling system "Perspell” for Persian non-word and real-word errors using a hybrid scoring system and optimized language model by lexicon. This scoring system uses a combination of lexical and semantic features optimized by learning dataset. The weight of these features in scoring system is also optimized by learning phase. Perspell is compared with known Persian spellchecker systems and could overcome them in precision of detection and correction. Accordingly, the proposed Persian spell-checker system can also detect and correct real-word errors. This open challenge category of spelling is a complicated and time consuming task in Persian as well as, assessing the proposed method, the F-measure metric has improved significantly (about 10%) for detecting and correcting Persian words. In the proposed method, we used Persian language model with bootstrapping and smoothing to overcome data sparseness and lack of data. The bootstrapping is developed using a Persian dictionary and further we used word sense disambiguation to select the correct related replaced word.

Keywords: Spell Error Detection, Spell Error Correction, Persian spell Checker, NLP, Persian Language Model

تحلیل کاربرد الگوی فراگفتمان هایلند در خلاصه سازی خودکار استناد مدار: پیشنهاد طرح حاشیه نویسی بافتارهای استنادی

پگاه تاجر*، عبدالرسول جوکار، سیدمصطفی فخراحمد، هاجر ستوده، علیرضا خرمایی

فصلنامه کتابداری و اطلاع رسانی، سال بیست و دوم شماره 3 (پیاپی 87، پاییز 1398)، صص 91 -111

هدف

هدف مقاله حاضر، تحلیل کاربرد الگوی فراگفتمان هایلند در خلاصه سازی خودکار استنادمدار متون علمی و پیشنهاد یک طرح حاشیه نویسی فراگفتمان مدار برای بافتارهای استنادی به منظور به کار گیری در خلاصه سازی استنادمدار می باشد.

روش شناسی

روش شناسی این پژوهش از نوع کتابخانه ای است و پاسخ دهی به سوالات پژوهش، از طریق مطالعه و تحلیل منابع مربوط به الگوی فراگفتمان هایلند، خلاصه سازی خودکار متون علمی، تحلیل بافتارهای استناد و طبقه بندی کارکردهای استناددهی انجام شده است.

یافته ها

فراگفتمان تعاملی هایلند برای نشان دادن چشم انداز نویسنده نسبت به اطلاعات گزاره ای و خواننده به کار می رود، از ابزارهای زبانی مناسب ژانر نقد بهره می برد و برای تحلیل بافتارهای استنادی مناسب است. بنابراین، طرح حاشیه نویسی فراگفتمان مدار بافتارهای استنادی بر اساس تردیدنما، یقین نما، نگرش نما، خوداظهارها و دخیل سازها که از مولفه های اصلی فراگفتمان تعاملی - مشارکتی هایلند هستند، پیشنهاد شد. این طرح شامل 70 طبقه می باشد.

نتیجه گیری

از فراگفتمان تعاملی هایلند می توان برای ساخت پیکره مناسب جهت خلاصه سازی خودکار استنادمدار بهره گرفت و مراحل ایجاد رده بند های مورد نیاز فرآیند خلاصه سازی، پالایش بافتارهای استنادی و انتخاب جملات برای درج در خلاصه نهایی را بر اساس آن انجام داد. حاشیه نویسی پیکره ها عموما بر اساس یک طرح حاشیه نویسی انجام می شود. بنابراین، طرح پیشنهاد شده می تواند مفید واقع شود. با توجه به این که طرح حاشیه نویسی پیشنهاد شده مبتنی بر نظریات موجود است، لازم است در به کارگیری آن، از حاشیه نویسان خواسته شود تا در حین برچسب زنی، هر برچسب دیگری غیر از موارد مطرح شده در طرح را که به ذهنشان می رسد با ذکر دلیل، یادداشت نمایند تا در صورت احراز توافق مطلوب به طرح اضافه گردد.

کلید واژگان: فراگفتمان هایلند, بافتارهای استنادی, خلاصه سازی استنادمدار, طرح حاشیه نویسی, متون علمی

Analyzing the Application of Hyland Metadiscourse Model for Citation-based Automatic Text Summarization: A proposed Annotation Scheme for Citation Contexts

Pegah Tajer*, Abdorasoul Jowkar, Seyed Mostafa Fakhrahmad, Hajar Sotoudeh, Alireza Khormaee

Library and Information Science, Volume:22 Issue: 3, 2019, PP 91 -111

Objective

Author's abstract contains those contributions that the author himself considers important. Meanwhile, they may be less important among scientific community. This supplementary information can be obtained by analyzing citing articles. Citation contexts citing a cited article are actually summaries of that article produced by the scientific community. This type of summary is called citation summary which can provide a deeper insight into the impact of that article on scientific community. Selecting useful citation sentences to be inserted in a system summary is one of the major challenges of citation-based automatic text summarization. Hence, the semantic approach of analyzing citation contexts reveals citation functions; it can be used to refine citation contexts and to insert important content in the final summary. So, approaches like metadiscourse analysis that provide more information would result in producing useful summaries. Therefore, this paper aims at analyzing the application of Hyland metadiscourse model for citation-based automatic summarization of scientific texts. Moreover, based on Hyland Metadiscourse Model, an annotation scheme was proposed for citation contexts which could be used in corpus-based citation summarization systems.

Methodology

This is a library research that answers research questions through studying and analyzing resources related to Hyland Metadiscourse Model, Scientific Text Summarization, Citation Context Analysis and Citation Function Classification. The scheme was evolved during two stages of analysis. First, an initial scheme was created based on studying existing schemes. Then, its metadiscourse version was suggested through analyzing Hyland Metadiscourse Model. Expert evaluation was performed for validating the proposed annotation scheme. Three experts in Information Science and two in Linguistics confirmed the scheme.

Findings

>Hyland interactional metadiscourse is suitable for analyzing citation contexts because it is used to represent the author's perspective on propositional information and also the reader. Moreover, interactional metadiscourse analysis applies appropriate language tools for the critique genre. Therefore, a scheme was proposed based on boosters, attitude markers, hedges, engagement markers and self-mentions which are the main components of Hyland interactional metadiscourse. The proposed scheme includes 70 classes.

Conclusion

Hyland interactive metadiscourse can be used to construct proper corpora for automatic citation-based text summarization. Also, some other phases of automatic summarization such as classifier development, citation context refinement, and sentence selection could be performed based on this type of metadiscourse. Annotating corpora is usually performed using an annotation scheme. Thus, the proposed annotation scheme would be beneficial. However, it is a conceptual scheme proposed on existing theories. So, it is necessary to ask annotators to write down any new labels while annotating. Moreover, they should make some notes about the reasons of creating new ones. In the next stage, if desirable agreement is reached those labels could be added to the scheme.

Keywords: Annotation Scheme, Citation-based Summarization, Citation Contexts, Hyland Metadiscourse Model, Scientific Texts

بافتار استنادهای مقاله های علم اطلاعات

پگاه تاجر، سیدمصطفی فخراحمد، عبدالرسول جوکار*، علیرضا خرمایی، هاجر ستوده

نشریه مطالعات کتابداری و سازماندهی اطلاعات، سال سی‌ام شماره 3 (پیاپی 119، پاییز 1398)، صص 24 -44

هدف

شناسایی، طبقه بندی، و تحلیل بافتارهای استنادی مقالات علم اطلاعات و انواع استنادها با رویکرد فراگفتار هایلند.

روش شناسی

این پژوهش در دو مرحله «شناسایی طبقه استناد» (Jurgens et al., 2016) و «تحلیل مبتنی بر فراگفتار کارکرد شناسایی شده» (Hyland, 2005) انجام شده است. 164 بافتار استنادی مقالات استناد کننده به 10 مقاله به زبان انگلیسی (مجموعا شامل 656 جمله استنادی صریح و ضمنی) بررسی شده است.

یافته ها

استنادها از نظر فراگفتار در 2 طبقه اصلی استناد های «تعاملی- هدایتی» و «تعاملی- مشارکتی» با 4 طبقه فرعی در سطح دوم، 14 طبقه فرعی تر در سطح سوم و 23 طبقه فرعی تر در سطح چهارم دسته بندی شد. استنادهای شناسایی شده بیشتر از نوع تعاملی- هدایتی بود و نه تعاملی- مشارکتی. طبقات درک شده نیز بیشتر توصیفی بودند و نه تحلیلی و نقدگونه.

نتیجه گیری

شباهت طبقه بندی استنادهای این مطالعه با طرح های موجود تا سطح سوم است و اغلب نیز با طبقات استنادهای تعاملی- هدایتی اشتراکاتی دارند. به نظر می رسد انواع استنادهای تعاملی- مشارکتی شناسایی شده بتواند در پالایش بافتارهای استنادی در سامانه های بازیابی اطلاعات متون علمی و در ارزیابی کیفی تاثیرگذاری پژوهش ها یاری رسان باشد.

کلید واژگان: طبقه بندی استناد, تحلیل فراگفتار, تحلیل بافتار استناد, علم اطلاعات, الگوی هایلند

Citation Contexts of Information Science Articles

P. Tajer, S. M. Fakhrahmad, A. Jowkar *, A. Khormaee, H. Sotudeh

Librarianship and Informaion Organization Studies, Volume:30 Issue: 3, 2019, PP 24 -44

Purpose

Identifying, classifying, and analyzing citation contexts of information science articles based on Hyland's meta-discourse approach.

Methodology

This research was carried out in two phases: "citation class identification" (Jurgens et al., 2016) and "metadiscourse-based analysis of identified function" (Hyland, 2005). 164 citation contexts of 10 citing articles in English (including 656 explicit and implicit sentences) were analyzed.

Findings

Based on metadiscourse, citation functions were categorized in 2 classes named "Interactive citations" and "Interactional citations" included 4 sub-classes in the second level, 14 sub-classes in the third level and 23 sub-classes in the forth. Interactive functions were understood more than interactional ones. Moreover, the perceived classes were more descriptive than analytical.

Conclusions

The similarity between the taxonomy perceived in this study and existing citation classification schemes in the literature, is to the third level. In addition, most similarities are in the area of interactive functions. It seems that interactional citations identified in this study could be used to refine citation contexts in scientific information retrieval systems as well as in the process of qualitative evaluation of the impact of research.

Keywords: Citation classification, Meta-discourse Analysis, Citation Context Analysis, Information Science, Hyland Model

روند رشد رویکرد تحلیل چهریزه ای در سازماندهی دانش: مروری صد ساله

عبدالحسین فرج پهلو*، فریده عصاره، سید مصطفی فخر احمد، لیلا دهقانی

پژوهشنامه پردازش و مدیریت اطلاعات، سال سی و چهارم شماره 3 (پیاپی 97، بهار 1398)، صص 1235 -1264

رویکرد تحلیل چهریزه ای از اوایل قرن بیستم تاکنون روند رشد مستمری داشته است. هدف این مقاله مرور سیستماتیک پژوهش ها و مستندات طرح های سازماندهی چهریزه ای و نیز تقسیم بندی موضوعی و زمانی این مطالعات است. با مرور صورت گرفته، روند رشد و توسعه کاربردهای این رویکرد در ابزارهای سازماندهی و بازیابی اطلاعات شناسایی و پیشنهاداتی برای پژوهشگران آینده ارائه گردید. برای این منظور در گام اول، جست وجوی جامع در منابع و بررسی اولیه اسناد؛ در گام دوم، طبقه بندی و پالایش اسناد؛ و در گام سوم، طبقه بندی زمانی و موضوعی اسناد و تحلیل متون و شناسایی شکاف های موجود و در نهایت، پیشنهاداتی برای پوشش این شکاف ها صورت گرفت. حاصل تلاش های انجام شده قبلی، توسعه رده بندی های چهریزه ای، اصطلاحنامه ها و سرعنوان های چهریزه ای و نظام های بازیابی اطلاعات چهریزه ای بود که به طوری گسترده تا دهه 1990 میلادی ادامه داشت؛ اما بعد از آن با توسعه سیستم های کامپیوتری و وب، چهریزه ها نقش دیگری در بازیابی اطلاعات در پایگاه داده بر عهده گرفتند. در این دوره مجموعه ای از مدل ها، فراداده های چهریزه ای، رابط های کاربری چهریزه ای و آنتولوژی های چهریزه ای شکل گرفت و نرم افزارهای متعددی در این زمینه توسعه یافت. رویکرد تحلیل چهریزه ای از حدود اوایل قرن بیستم تا سال 1990 میلادی بر مبنای نظام منطقی (پیشینی) طبقه بندی علوم پیش رفته است. اما از آن سال به بعد، به دلیل گسترش توانایی های کامپیوتری و رشد نیازهای کاربران، دیدگاه منطقی جای خود را به دیدگاه محاسباتی و کاربرمدار (پسینی) سپرد. ایجاد ساختار چهریزه ها در محیط وب معنایی و ایجاد استانداردهای جدید، بهره برداری از روش های موثرتر درک رفتار کاربران و توجه به توسعه و تحول تاریخی علم، شکاف هایی است که هنوز نیاز به مطالعه و بررسی بیشتر دارد. پوشش این شکاف ها، تاثیر پایدار فرایند تحلیل چهریزه در آینده را نوید می دهد.

کلید واژگان: سازماندهی دانش, بازیابی اطلاعات, چهریزه, تحلیل چهریزه ای, مرور سیستماتیک

The development of facet analysis approach in knowledge organization: a 100-Year Review

Abdolhossein Farajpahlou*, Farideh Osareh, Seyed Mostafa Fakhrahmad, Leila Dehghani

Journal of Information Processing and Management, Volume:34 Issue: 3, 2019, PP 1235 -1264

Facet analysis approach (FAA) has exhibited a continuous growth trend since the early 20th century. The present paper is aimed to systematically review the studies and documents of the faceted organization plans as well as thematic and temporal classification of these works. This review led to the identification of the growth and development trends of applying this approach in the information organization and retrieval tools, followed by providing some suggestions for researchers in future works. Accordingly, the steps to be taken in the present work were as follows: The first step included a comprehensive search in relevant references as well as a primary review of the documents, followed by classification and refinement of the documents in the second step. The third step addressed temporal and thematic classification of the documents, analysis of the literature, and identification of the existing gaps. In the final step, some suggestions were provided for covering these gaps. The outcomes of the previous works included the development of faceted rankings, faceted glossaries and headings, and faceted information retrieval systems, the extensive use of which continued until 1990s. Subsequently, with the development of computer systems and web, the facets took another role in the retrieval of the data available in the database. During this period, a set of models, faceted metadata, faceted user interfaces, and faceted ontologies was made, which was followed by the development of several software in this field. The facet analysis approach has been developing since the early 20th century to 1990 based on the logical system (a priori) of science classification. However, since then, due to the development of the computer capabilities and growth of the users' needs, the logical perspective was replaced by the computational and user-oriented (a posteriori) perspective. Creating the structure of facets in semantic web environments and formulation of new standards, utilizing more effective methods to perceive user behaviors, and taking historical development and changes of sciences into account are the gaps that still require further studies. Covering these gaps promises sustainable effectiveness of the facet analysis process in future.

Keywords: knowledge organization, information retrieval, facet, Facet analysis, Systematic review

به جمع مشترکان مگیران بپیوندید!

سید مصطفی فخراحمد