Density-Based K-Nearest Neighbor Active Learning for Improving Farsi-English Statistical Machine Translation System

Author(s):

Somayeh Bakhshaei , Reza Safabakhsh , Shahram Khadivi

Abstract:

Labeled data are useful resources for different application in different fields like image processing, natural language processing etc. Producing labeled data is a costly process. One efficient solution for alleviating the costly process of annotating data is managing the sampling process. It is better to query for essential samples instead of a group of unnecessary ones. Active learning (AL) attempts to overcome the labeling bottleneck by sending queries for unlabeled instances to be labeled with the help of an annotator. This technique is applied to Natural Language Processing (NLP) especially in Statistical Machine Translation (SMT) tasks that we also focus on in this work. In Statistical Machine Translation, parallel corpora are scarce resources, and AL is a way of solving this problem. It attempts to alleviate the costly process of data annotating by sending queries just for translation of the most informative sentences which are essential for system improvement. The contribution of our work is proposing a new approach in AL for selecting sentences through a soft decision making process. In this algorithm, in addition to scoring sentences according to their information, the distribution of the space of unlabeled data is also considered. Each sentence (either labeled or unlabeled) changes to a vector of feature scores. Then each new coming sentence is observed in the feature space and gets two probabilities: how probable it is to be either labeled or unlabeled. These probabilities are calculated according to the position of new instance related to its labeled and unlabeled neighbors. We have applied the proposed model for improving training corpus of a SMT system. Also Farsi-English language pairs are selected as the base-line SMT system. We have sampled the best sentences that can improve the quality of our SMT system and send query for their translations. In this way the costly approach of making parallel corpus is alleviated. Finally, our experiments show significant improvements for sampling sentences by soft decision making in comparison to the random sentence selection strategy.

Keywords:

component , Active Learning , Statistical Machine Translation , Farsi , English pair Languages , Soft Decision Making , Kernel Based Distance , Density Based KNN

Language:

English

Published:

International Journal Information and Communication Technology Research, Volume:7 Issue: 3, Summer 2015

Pages:

63 to 72

https://www.magiran.com/p1507534

دانلود و مطالعه متن این مقاله با یکی از روشهای زیر امکان پذیر است:

اشتراک شخصی

با ثبت ایمیلتان و پرداخت حق اشتراک سالانه به مبلغ 1,390,000ريال، بلافاصله متن این مقاله را دریافت کنید.اعتبار دانلود 70 مقاله نیز در حساب کاربری شما لحاظ خواهد شد.

پرداخت حق اشتراک به معنای پذیرش "شرایط خدمات" پایگاه مگیران از سوی شماست.

اگر مقاله ای از شما در مگیران نمایه شده، برای استفاده از اعتبار اهدایی سامانه نویسندگان با ایمیل منتشرشده ثبت نام کنید. ثبت نام

اشتراک سازمانی

به کتابخانه دانشگاه یا محل کار خود پیشنهاد کنید تا اشتراک سازمانی این پایگاه را برای دسترسی نامحدود همه کاربران به متن مطالب تهیه نمایند!

اطلاعات بیشتر ثبت نام با ایمیل دانشگاهی/سازمانی

توجه!

حق عضویت دریافتی صرف حمایت از نشریات عضو و نگهداری، تکمیل و توسعه مگیران می‌شود.
پرداخت حق اشتراک و دانلود مقالات اجازه بازنشر آن در سایر رسانه‌های چاپی و دیجیتال را به کاربر نمی‌دهد.

In order to view content subscription is required

Personal subscription

Subscribe magiran.com for 70 € euros via PayPal and download 70 articles during a year.

Organization subscription

Please contact us to subscribe your university or library for unlimited access!

More information

علمی مصوب

International Journal Information and Communication Technology Research

مجله بین المللی فناوری اطلاعات و ارتباطات

فصلنامه فنی مهندسی به زبان انگلیسی

آخرین شماره | آرشیو

ISSN: 2251-6107

صاحب امتیاز:

پژوهشگاه ارتباطات و فناوری اطلاعات

مدیر مسئول:

دکتر سید محمد رضوی زاده

سردبیر:

دکتر احمد خادم زاده

تلفن نشریه: ۰۲۱-۸۸۶۳۰۰۵۹

اطلاعات بیشتر نشریه

درباره نشریه پیام به نشریه سایت اختصاصی نشریه پذیرش الکترونیکی مقاله

به جمع مشترکان مگیران بپیوندید!

Density-Based K-Nearest Neighbor Active Learning for Improving Farsi-English Statistical Machine Translation System

Somayeh Bakhshaei , Reza Safabakhsh , Shahram Khadivi

component , Active Learning , Statistical Machine Translation , Farsi , English pair Languages , Soft Decision Making , Kernel Based Distance , Density Based KNN

International Journal Information and Communication Technology Research

مجله بین المللی فناوری اطلاعات و ارتباطات