جستجوی مقالات مرتبط با کلیدواژه « speech recognition » در نشریات گروه « برق »

تکرار جستجوی کلیدواژه «speech recognition» در نشریات گروه «فنی و مهندسی»

انتخاب همه

تشخیص عبارت های گفتاری برای اخبار فارسی صداوسیمای جمهوری اسلامی ایران

هادی ویسی*، سید اکبر قریشی، اعظم باستان فرد

فصلنامه پردازش علائم و داده ها، سال هفدهم شماره 4 (پیاپی 46، زمستان 1399)، صص 67 -88

هدف از تشخیص عبارت های گفتاری یا جستجوی کلیدواژه، تشخیص و جستجوی مجموعه ای از کلیدواژه ها در مجموعه ای از اسناد گفتاری (مانند سخنرانی ها، جلسه ها) است. در این پژوهش تشخیص عبارت های گفتاری فارسی برپایه سامانه های بازشناسی گفتار با کاربرد در بازیابی اطلاعات در بایگانی های گفتاری و ویدیویی سازمان صدا و سیما طراحی و پیاده سازی شده است. برای این کار، ابتدا اسناد گفتاری به متن، بازشناسی، سپس بر روی این متون جستجو انجام می شود. برای آموزش سامانه بازشناسی گفتار فارسی، دادگان فارس دات بزرگ به کار رفته است. این سامانه به نرخ خطای واژه 71/2 درصد بر روی همین دادگان و 23/28 درصد بر روی دادگان اخبار فارسی با استفاده از مدل زیر فضای مخلوط گوسی (SGMM) رسید. برای تشخیص عبارت های گفتاری از روش پایه واژگان نماینده استفاده شده و با استفاده از شبکه حافظه کوتاه-مدت ماندگار و دسته بندی زمانی پیوندگرا (LSTM-CTC) روشی برای بهبود تشخیص واژگان خارج از واژگان (OOV) پیشنهاد شده است. کارایی سامانه تشخیص عبارات با روش واژه های نماینده بر روی دادگان فارس دات بزرگ بر طبق معیار ارزش وزنی واقعی عبارت (ATWV) برابر با 9206/0 برای کلیدواژه های داخل واژگان و برابر با 2/0 برای کلیدواژه های خارج از واژگان رسید که این نرخ برای واژگان OOV با استفاده از روش LSTM-CTC با حدود پنجاه درصد بهبود به مقدار 3058/0 رسید؛ همچنین، در تشخیص عبارت های گفتاری بر روی دادگان اخبار فارسی، ATWV برابر 8008/0 حاصل شد.

کلید واژگان: تشخیص عبارت های گفتاری فارسی, جستجوی کلیدواژه, بازشناسی گفتار, سازمان صداوسیما, کلدی}

چکیده مشاهده متن مقاله پژوهشی/اصیل زبان: فارسی

Spoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting

Hadi Veisi*, Sayed Akbar Ghoreishi, Azam Bastanfard

Signal and Data Processing, Volume:17 Issue: 4, 2021, PP 67 -88

Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting. The aim of this research is to design a content retrieval engine for the IRIB’s media and production using spoken term detection (STD) or keyword spotting. The goal of an STD system is to search for a set of keywords in a set of speech documents. One of the methods for STD is using a speech recognition system in which speech is recognized and converted into text and then, the text is searched for the keywords. Variety of speech documents and the limitation of speech recognition vocabulary are two challenges of this approach. Large vocabulary continuous speech recognition systems (LVCSR) usually have limited but large vocabulary and these systems canchr('39')t recognize out of vocabulary (OOV) words. Therefore, LVCSR-based STD systems suffer OOV problem and canchr('39')t spotting the OOV keywords. Methods such as the use of sub-word units (e.g., phonemes or syllables) and proxy words have been introduced to overcome the vocabulary limitation and to deal with the out of vocabulary (OOV) keywords. This paper proposes a Persian (Farsi) STD system based on speech recognition and uses the proxy words method to deal with OOV keywords. To improve the performance of this method, we have used Long Short-Term Memory-Connectionist Temporal Classification (LSTM-CTC) network. In our experiments, we have designed and implemented a large vocabulary continuous speech recognition systems for Farsi language. Large FarsDat dataset is used to train the speech recognition system. FarsDat contains 80 hours voices from 100 speakers. Kaldi toolkit is used to implement speech recognition system. Since limited dataset, Subspace Gaussian Mixture Models (SGMM) is used to train acoustic model of the speech recognition. Acoustic model is trained based context tri-phones and language model is probability tri-gram words model. Word Error Rate (WER) of Speech recognition system is 2. 71% on FARSDAT test set and also 28.23% on the Persian news collected from IRIB data. Term detection is designed based on weighted finite-state transducers (WFST). In this method, first a speech document is converted to a lattice by the speech recognizer (the lattice contains the full probability of speech recognition system instead of the most probable one), and then the lattice is converted to WFST. This WFST contains the full probability of words that speech recognition computed. Then, text retrieval is used to index and search over the WFST output. The proxy words method is used to deal with OOV. In this method, OOV words are represented by similarly pronunciation in-vocabulary words. To improve the performance of the proxy words methods, an LSTM-CTC network is proposed. This LSTM-CTC is trained based on charterers of words separately (not a continuous sentence). This LSTM-CTC recomputed the probabilities and re-verified proxy outputs. It improves proxy words methods dues to the fact that proxy words method suffers false alarms. Since LSTM-CTC is an end-to-end network and is trained based on the characters, it doesnchr('39')t need a phonetic lexicon and can support OOV words. As the LSTM-CTC is trained based on the separate words, it reduces the weight of the language model and focuses on acoustic model weight. The proposed STD achieve 0.9206 based Actual Term Weighted Value (ATWV) for in vocabulary keywords and for OOV keywords ATWV is 0.2 using proxy word method. Applying the proposed LSTM-CTC improves the ATWV rate to 0.3058. On Persian news dataset, the proposed method receives ATWV of 0.8008.

Keywords: Persian Spoken Term Detection, IRIB, Persian News, Keyword Spotting, Speech Recognition, Kaldi}

Abstract View Paper Research/Original Article Original: Persian
بازشناسی مقاوم گفتار با استفاده از شبکه های عصبی حافظه کوتاه مدت ماندگار و ویژگی های گلوگاه

امین معاون جولا، احمد اکبری *، بابک ناصر شریف

نشریه مهندسی برق، سال چهل و نهم شماره 3 (پیاپی 89، پاییز 1398)، صص 1333 -1343

شبکه های عصبی عمیق در سال های اخیر به طرز گسترده ای در سیستم های بازشناسی گفتار مورداستفاده قرارگرفته اند. بااین وجود، مقاوم سازی این مدل ها در حضور نویز محیط کمتر موردبررسی قرارگرفته است. در این مقاله دو راهکار برای مقاوم سازی مدل های شبکه حافظه کوتاه مدت ماندگار نسبت به نویز جمع پذیر محیطی موردبررسی قرارگرفته است. راهکار اول افزایش مقاومت مدل های شبکه حافظه کوتاه مدت ماندگار نسبت به حضور نویز است که با توجه به خصوصیت این شبکه ها در یادگیری رفتار بلندمدت نویز ارائه می شود. بدین منظور پیشنهاد می شود از گفتار نویزی برای آموزش مدل ها استفاده شود تا به صورت آگاه به نویز آموزش ببینند. نتایج روی مجموعه داده نویزی شده TIMIT نشان می دهد که اگر مدل ها به جای گفتار تمیز با گفتار نویزی آموزش ببینند، دقت بازشناسی تا 18 درصد بهبود خواهد یافت. راهکار دوم کاهش تاثیر نویز بر ویژگی های استخراج شده با استفاده از شبکه خود رمزگذار کاهنده نویز و استفاده از ویژگی های گلوگاه به منظور فشرده سازی بردار ویژگی و بازنمایی سطح بالاتر ویژگی های ورودی است. این راهکار باعث می شود مقاومت ویژگی ها نسبت به نویز بیشتر شده و درنتیجه دقت سیستم بازشناسی پیشنهادشده در راهکار اول، در حضور نویز 4 درصد افزایش یابد.

کلید واژگان: بازشناسی گفتار, مقاومت نسبت به نویز, داده های چند شرطی, شبکه خود رمزگذار, شبکه حافظه کوتاه مدت ماندگار}

چکیده مشاهده متن مقاله پژوهشی/اصیل زبان: فارسی

Robust Speech Recognition using Long Short Term Memory Networks and Bottleneck Features

Amin Moaven Joula, Ahmad Akbari*, Babak Naser Sharif

Journal of Electrical Engineering, Volume:49 Issue: 3, 2020, PP 1333 -1343

Deep neural networks have been widely used in speech recognition systems in recent years. However, the robustness of these models in the presence of environmental noise has been less discussed. In this paper, we propose two approaches for the robustness of deep neural networks models against environmental additive noise. In the first approach, we propose to increase the robustness of long short-term memory (LSTM) networks in the presence of noise based on their abilities in learning long-term noise behavior. For this purpose, we propose to use noisy speech for training models. In this way, LSTMs are trained in a noise-aware manner. The results on the noisy TIMIT dataset show that if the models are trained with noisy speech rather than clean speech, recognition accuracy will be improved up to 18%. In the second approach, we propose to reduce noise effects on the extracted features using a denoised autoencoder network and to use the bottleneck features to compress the feature vector and represent the higher level of input features. This method increases the accuracy of the proposed recognition system in the first approach by 4% in the presence of noise.

Keywords: Speech recognition, Noise robustness, Multicondition data, Autoencoder network, Long short term memory network}

Abstract View Paper Research/Original Article Original: Persian
شبکه عصبی پیچشی با پنجره های قابل تطبیق برای بازشناسی گفتار

تکتم ذوقی، محمد مهدی همایون پور*

فصلنامه پردازش علائم و داده ها، سال پانزدهم شماره 3 (پیاپی 37، پاییز 1397)، صص 13 -30

در حالی که سامانه های بازشناسی گفتار به طور پیوسته در حال ارتقا می باشند و شاهد استفاده گسترده از آن ها می باشیم، اما دقت این سامانه ها فاصله زیادی نسبت به توان بازشناسی انسان دارد و در شرایط ناسازگار این فاصله افزایش می یابد. یکی از علل اصلی این مسئله تغییرات زیاد سیگنال گفتار است. در سال های اخیر، استفاده از شبکه های عصبی عمیق در ترکیب با مدل مخفی مارکف، موفقیت های قابل توجهی در حوزه پردازش گفتار داشته است. این مقاله به دنبال مدل کردن بهتر گفتار با استفاده از تغییر ساختار در شبکه عصبی پیچشی عمیق است؛ به نحوی که با تنوعات بیان گویندگان در سیگنال گفتار منطبق تر شود. در این راه، مدل های موجود و انجام استنتاج بر روی آن ها را بهبود و گسترش خواهیم داد. در این مقاله با ارائه شبکه پیچشی عمیق با پنجره های قابل تطبیق سامانه بازشناسی گفتار را نسبت به تفاوت بیان در بین گویندگان و تفاوت در بیان های یک گوینده مقاوم خواهیم کرد. تحلیل ها و نتایج آزمایش های صورت گرفته بر روی دادگان گفتار فارس دات و TIMIT نشان داد که روش پیشنهادی خطای مطلق بازشناسی واج را نسبت به شبکه پیچشی عمیق به ترتیب به میزان 2/1 و 1/1 درصد کاهش می دهد که این مقدار در مسئله بازشناسی گفتار مقدار قابل توجهی است.

کلید واژگان: بازشناسی گفتار, شبکه عصبی عمیق, شبکه عصبی پیچشی, پنجره های قابل تطبیق}

چکیده مشاهده متن مقاله پژوهشی/اصیل زبان: فارسی

Adaptive Windows Convolutional Neural Network for Speech Recognition

Toktam Zoughi, Mohammad Mehdi Homayounpour *

Signal and Data Processing, Volume:15 Issue: 3, 2018, PP 13 -30

Although, speech recognition systems are widely used and their accuracies are continuously increased, there is a considerable performance gap between their accuracies and human recognition ability. This is partially due to high speaker variations in speech signal. Deep neural networks are among the best tools for acoustic modeling. Recently, using hybrid deep neural network and hidden Markov model (HMM) leads to considerable performance achievement in speech recognition problem because deep networks model complex correlations between features. The main aim of this paper is to achieve a better acoustic modeling by changing the structure of deep Convolutional Neural Network (CNN) in order to adapt speaking variations. In this way, existing models and corresponding inference task have been improved and extended.
Here, we propose adaptive windows convolutional neural network (AWCNN) to analyze joint temporal-spectral features variation. AWCNN changes the structure of CNN and estimates the probabilities of HMM states. We propose adaptive windows convolutional neural network in order to make the model more robust against the speech signal variations for both single speaker and among various speakers. This model can better model speech signals. The AWCNN method applies to the speech spectrogram and models time-frequency varieties.
This network handles speaker feature variations, speech signal varieties, and variations in phone duration. The obtained results and analysis on FARSDAT and TIMIT datasets show that, for phone recognition task, the proposed structure achieves 1.2%, 1.1% absolute error reduction with respect to CNN models respectively, which is a considerable improvement in this problem. Based on the results obtained by the conducted experiments, we conclude that the use of speaker information is very beneficial for recognition accuracy.

Keywords: Speech recognition, deep neural network, Convolutional neural network, Adaptive windows convolutional neural network}

Abstract View Paper Research/Original Article Original: Persian
ارائه یک دسته بند مقاوم به منظور بازشناسی گفتار مبتنی بر هم افزایی خوشه بندی و فراوانی مشاهدات

محمد مصلح، محمد خیراندیش، مهدی مصلح، نجمه حسین پور

نشریه صنایع الکترونیک، سال هشتم شماره 3 (پیاپی 30، پاییز 1396)، ص 111

بازشناسی گفتار به عنوان یکی از مهمترین شاخه های پردازش گفتار از دیر باز مورد توجه پژوهشگران و محققین بوده است. بازشناسی گفتار تکنولوژی است که قادر است کلمه (کلمات) اداء شده را که با یک سیگنال آکوستیک نمایش داده می-شود، معین نماید. پیچیدگی سیستم های بازشناسی گفتار به ویژگی های استخراج شده، بعد آنها و نیز دسته بند بکار گرفته شده بستگی دارد. در این مقاله، یک دسته بند جدید پیشنهاد می شود که قادر است در فاز استخراج دانش، از طریق هم افزایی خوشه بندی و فراوانی مشاهدات، یک مدل مناسب برای هر کلمه مرجع، در قالب دو ماتریس "برنده" و "حداقل فاصله"، محاسبه نماید. در مرحله بازشناسی، روش پیشنهادی قادر است با استفاده از یک مکانیزم جریمه-پاداش، میزان شباهت بین گفتار ورودی ناشناخته و مدل های مرجع کلمات را معین نماید. به منظور ارزیابی روش پیشنهادی از پایگاه داده فارس دات استفاده شده است. نتایج حاصل از آزمایشات متعدد بر روی سیگنال های تمیز و نویزی نشان می دهند روش پیشنهادی از مقاوم پذیری بهتری در برابر نویز، دقت بالاتر و نیز پیچیدگی زمانی کمتری در مقایسه با سیستم های بازشناسی گفتار مبتنی بر مدل مخفی مارکوف برخوردار است.

کلید واژگان: بازشناسی گفتار, دسته بندی, مدل های مخفی مارکوف, خوشه بندی, استخراج ویژگی, مقاوم پذیری}

چکیده مشاهده متن مقاله پژوهشی/اصیل زبان: فارسی

Proposing a Robust Classifier for Speech Recognition Based on Synergy Clustering and Observations Frequency

Mohammad Mosleh, Mohammad Kheyrandish, Najmeh Hosseinpour, Mahdi Mosleh

Electronics Industries, Volume:8 Issue: 3, 2017, P 111

Speech recognition as one of the important branches of speech processing has been attractive for researchers and scientist, from long time ago. Speech recognition is a kind of technology able to determine the pronounced word (s) shown by acoustic signal. The complexity of speech recognition systems depends on the extracted features, their dimensions and the applied classifier. In this paper, we propose a new classifier which is able to compute two matrices “winner” and “minimum distance” in a knowledge extraction phase, as a suitable model for any reference word using synergy clustering and frequency of observations. In the recognition phase, the proposed method is able to determine the similarity between inputted unknown speech and word reference models based on a penalty-reward mechanism. In order to evaluate the proposed method, the FARSDAT data set is used. The results of several experiments on clean and noisy signals show more resistant against noise, higher accuracy and less time complexity for the proposed method, in comparison to the HMM-based speech recognition system.

Keywords: Classification, Feature Extraction, Hidden Markov Model (HMM), robustness, Clustering, Speech recognition}

Abstract View Paper Research/Original Article Original: Persian
ارائه یک روش جدید بازیابی اطلاعات مناسب برای متون حاصل از بازشناسی گفتار

روح الله دیانت، مرتضی علی احمدی *، یحیی اخلاقی، باقر باباعلی

فصلنامه پردازش علائم و داده ها، سال سیزدهم شماره 4 (پیاپی 30، زمستان 1395)، صص 93 -108

در این مقاله، یک پیش پردازش روی روش های بازیابی اطلاعات، ارائه می شود که برای بازیابی اطلاعات حاصل از متون بازشناسی شده ی گفتاری، مناسب است. این پیش پردازش، به شکل ترکیبی از اصلاح و گسترش پرس‏ و جو می ‏باشد. ورودی‏ های مسئله، اسناد متنی بدست آمده از بازشناسی گفتار و پرس‏ و جو می باشد و هدف، یافتن اسناد مرتبط با کلمه پرس ‏و جو است. مشکل آن است که متن حاصل از بازشناسی گفتار، همواره دارای درصد خطایی در بازشناسی است که ممکن است منجر به این شود که کلماتی که در واقع مرتبط هستند و به‏ علت وقوع خطای بازشناسی دگرگون شده‏ اند مرتبط تشخیص داده نشوند. ایده ی روش ارائه شده، تشخیص خطای بازشناسی در کلمات و در نظر گرفتن کلمات مشابه برای آن دسته از کلماتی است که به عنوان خطا تشخیص داده شده اند. برای تشخیص کلمه ی خطا، پارامتری به عنوان احتمال خطا در کلمه تعریف می‏ شود که بزرگ بودن آن بیانگر امکان بیشتر وقوع خطا در کلمه است. همچنین برای تشخیص کلمات مشابه، ابتدا با استفاده از معیار فاصله لونشتاین، کلمات مشابه اولیه را پیدا می کنیم. سپس احتمال تبدیل این کلمات مشابه به کلمه پرس ‏و جوی اصلی، محاسبه می شود. کلمات مشابه معنایی، از بین کلماتی که احتمال تبدیل بیشتری دارند، بر اساس یک سطح آستانه انتخاب می شوند. اکنون در الگوریتم بازیابی، علاوه‏ بر کلمه اصلی، کلمات مشابه آن نیز در جستجو، مرتبط در نظر گرفته می‏ شوند. نتایج پیاده‏سازی ها نشان می‏دهد که الگوریتم ارائه شده، معیار F را به میزان حداکثر 30 % بهبود می بخشد.

کلید واژگان: بازیابی اطلاعات, بازشناسی گفتار, سند, پرس و جو, فاصله لونشتاین}

چکیده مشاهده متن زبان: فارسی

Introducing a new information retrieval method applicable for speech recognized texts

Rouhollah Dianat, Morteza Ali Ahmadi *, Yahya Akhlaghi, Bagher Babaali

Signal and Data Processing, Volume:13 Issue: 4, 2017, PP 93 -108

In this article a pre-processing method is introduced which is applicable in speech recognized texts retrieval task. We have a text corpus that generated from a speech recognition system and a query as inputs, want to search queries in these documents and find relevant documents. The main problem is that the typical speech recognized texts suffer from some percentage of recognition error. This problem causes terms to have erroneously assign to irrelevant documents.
The idea of our proposed method, is to detect error-prone terms and to find similar words for each term. A parameter is defined which calculate the probability for occurring error in the error-prone words. To recognize similar words for each specific term, based on a criterian which is called average detection rate (ADR) and levenshtein distance criterion, some candidates are chosen as the initial similar words set. Then, a conversion probability is defined based on the conversion rate (CR) and the noisy channel model (NCM) and the words with higher probability based on a threshold level are selected as the final similar words. In the retrieval process, these words are considered in the search step in addition to the base word. Implementation result shows a significant improvement up to 30% in F-measure in information retrieval method with consideration this pre-processing.

Keywords: Information retrieval, Speech recognition, Document, Query, Levenshtein Distance}

Abstract View Paper Original: Persian
تخمین سریع ضرایب پیچش در هنجارسازی طول مجرای صوتی با استفاده از امتیاز به دست آمده از مدلسازی تشخیص جنسیت

یاسر شکفته، حسن قلی پور، محمدمحسن گودرزی، جهانشاه کبودیان، فرشاد الماس گنج، شقایق رضا، ایمان صراف رضایی

فصلنامه پردازش علائم و داده ها، سال سیزدهم شماره 1 (پیاپی 27، بهار 1395)، صص 57 -70

یکی از مشکلات عمده ی سامانه های خودکار بازشناسی گفتار (ASR)، تنوعات موجود در بین گویند ه ها، کانال انتقال داده و محیط است که به علت وجود این تنوعات، کارایی این سامانه ها در شرایط کاربردی مختلف به شدت تغییر می کند. مقاوم سازی سیستم های بازشناسی جهت مقابله با این تغییرات از جمله مسائل حال حاضر در حوزه بازشناسی گفتار است. از جمله عواملی که باعث کاهش کارایی سیستم ها می شود، تمایز مشخصات صوتی آواهای یکسان تولید شده از گوینده های مختلف است. یکی از عوامل اصلی این مشکل ناشی از تفاوت موجود در طول مجرای صوتی (VTL) بین گوینده های مختلف می باشد. روش هنجارسازی طول مجرای صوتی (VTLN) از روش های رایج برای رفع این مشکل است که در آن برای هر گوینده یک ضریب پیچش فرکانسی تعیین می گردد. در این مقاله روش متداول تعیین ضریب پیچش با رویکرد مبتنی بر جستجو در یک سیستم بازشناسی گفتار پیوسته فارسی مبتنی بر مدل مخفی مارکوف معرفی و مشکلات محاسباتی استفاده از این روش شرح داده شده است. در نهایت روشی مبتنی بر رگرسیون خطی از روی امتیاز محاسبه شده از مدلسازی تشخیص جنسیت جهت تخمین ضرایب پیچش پیشنهاد شده است که منجر به کاهش قابل ملاحظه هزینه محاسباتی روش مبتنی بر جستجو می شود. علاوه بر این، نتایج آزمایشات بر روی دادگان آزمون گفتار تلفنی محاوره ای، بیانگر بهبود 54/0 درصدی دقت تشخیص کلمه روش پیشنهادی نسبت به روش متداول مبتنی بر جستجو می باشد.

کلید واژگان: بازشناسی گفتار, هنجارسازی طول مجرای صوتی, تشخیص جنسیت, رگرسیون خطی, ضریب پیچش فرکانسی}

چکیده مشاهده متن زبان: فارسی

Fast estimation of warping factor in the vocal tract length normalization using obtained scores of gender detection modeling

Yasser Shekofteh, Hasan Gholipor, M.Mohsen Goodarzi, Dr. Jahanshah Kabudian, Dr. Farshad Almasganj, Shaghayegh Reza, Iman Sarraf

Signal and Data Processing, Volume:13 Issue: 1, 2016, PP 57 -70

The performance of automatic speech recognition (ASR) systems is adversely affected by the variations in speakers, audio channels and environmental conditions. Making these systems robust to these variations is still a big challenge. One of the main sources of variations in the speakers is the differences between their Vocal Tract Length (VTL). Vocal Tract Length Normalization (VTLN) is an effective method introduced to cope with this variation. In this method, the speech spectrum of each speaker is frequency warped according to a specific warping factor of that speaker. In this paper, we first developed the common search-based method to obtain the appropriate warping factor over a HMM-based Persian continuous speech recognition system. Then pointing out the computational cost of search-based method, we proposed a linear regression process for estimating warping factor based on the scores generated by our gender detection system. Experimental results over a Persian conversational speech database shown an improvement about 0.54 percent in word recognition accuracy as well as a significant reduction in computational cost of estimating warping factor, compared to search-based approach.

Keywords: speech recognition, Vocal Tract Length Normalization, gender detection, linear regression, warping factor}

Abstract View Paper Original: Persian
نقشه برداری ویژگی با استفاده از شبکه باور عمیق برای تشخیص گفتار قوی

مجتبی غلامی پور *، بابک ناصرشریف

فصلنامه مهندسی برق مدرس، سال چهاردهم شماره 3 (پاییز 1393)، صص 24 -30

کارآیی سیستمهای بازشناسی گفتار خودکار در شرایط نویزی بخاطر عدم تطابق میان شرایط اموزش و آزمایش به شدت کاهش می یابد. روش های متعدی برای رفع این عدم تطابق پیشنهاد شده اند. در سالهای اخیر شبکه های عصبی عمیق به طرز گسترده ای در سیستمهای بازشناسی گفتار و نیز در مقاوم سازی آنها و استخراج ویژگی های مقاوم گفتار مورد استفاده قرار گرفته اند. در این مقاله، پیشنهاد می شود که از شبکه باور عمیق به عنوان یک رروش پس پردازش برای جبران اثر نویز بر روی ویژگی های مل کپستروم استفاده شود.علاوه بر این از شبکه باور عمیق برای استخراج ویزگی های آبشاری (احتمالات پسین وقوع واجها) از ضرایب حذف نویز شده مل گپستروم استفاده شده است تا ویزگی های مقاوم تر و متمایزسازتری حاصل گردد. بردار ویزگی مقاوم نهایی شامل ویزگی های مل کپستروم حذف نویز شده و ویژگی های ابشاری ذکر شده است. نتایج ارزیابی بر روی دادگان گفتاری aurora 2 نشانگر ان است که بردار ویژگی پیشنهادی بهتر ازویژگی های متداول و مشابه آن عمل می کند، طوری که دقت بازشناسی را نسبت به ویژگی های مل کپستروم 28% افزایش می دهد.

کلید واژگان: مل کپستروم, ویژگی آبشاری, شبکه باور عمیق, مقاوم سازی, بازشناسی گفتار}

چکیده مشاهده متن زبان: انگلیسی

Feature mapping using deep belief networks for robust speech recognition

Mojatba Gholmipour *, Babak Nasersharif

The Modares Journal of Electrical Engineering, Volume:14 Issue: 3, 2014, PP 24 -30

Performance of automatic speech recognition (ASR) systems degrades in noisy conditions due to mismatch between training and test environments. Many methods have been proposed for reducing this mismatch in ASR systems. In recent years, deep neural networks (DNNs) have been widely used in ASR systems and also robust speech recognition and feature extraction. In this paper, we propose to use deep belief network (DBN) as a post-processing method for de-noising Mel frequency cepstral coefficients (MFCCs). In addition, we use deep belief network for extracting tandem features (posterior probability of phones occurrence) from de-noised MFCCs (obtained from previous stage) to obtain more robust and discriminative features. The final robust feature vector consists of de-noised MFCCs concatenated to mentioned tandem features. Evaluation results on Aurora2 database show that the proposed feature vector performs better than similar and conventional techniques, where it increases recognition accuracy in average by 28% in comparison to MFCCs.

Keywords: MFCC, Tandem feature, DBN, Robustness, Speech recognition}

Abstract View Paper Original: English
بازشناسی مقاوم گفتار با استفاده از ویژگی الگوهای زمانی به دست آمده از ساختار شبکه عصبی بهینه شده MTMLP

یاسر شکفته*، فرشاد الماس گنج

نشریه هوش محاسباتی در مهندسی برق، سال پنجم شماره 3 (پاییز 1393)، صص 23 -36

ویژگی الگوهای زمانی سیگنال صوتی از دو حوزه زمانی و یا بردارهای بازنمایی شده قابل استخراج است. این ویژگی دربرگیرنده اطلاعات و مشخصات زمان بلند از تغییرات پیوسته واحدهای گفتاری است. در این مقاله، ویژگی الگوهای زمانی با استفاده از خروجی مقدار احتمال پسین واجی ساختار بهینه شده شبکه عصبی MTMLP، از مجموعه بردارهای بازنمایی مبتنی بر طیف (مانند ویژگی گفتاری LFBE) و همچنین، مبتنی بر کپستروم (مانند ویژگی گفتاری MFCC) استخراج شده است. با ترکیب اطلاعات الگوهای زمانی (دینامیک زمان بلند) به دست آمده از حوزه های لگاریتم طیف و کپستروم به بردار ویژگی های پایه بازشناسی، شامل ویژگی های گفتاری متداول MFCC و مشتقات زمانی اول و دوم آن (دینامیک زمان کوتاه)، نشان داده شده است که دقت بازشناسی واج در شرایط دادگان آزمون تمیز، حدود 1 درصد نسبت به نتایج بهترین سیستم پایه بازشناسی بهبود می یابد. این در حالی است که ویژگی های به دست آمده از روش پیشنهادی، بازشناسی مقاومتری را در شرایط نویزی مختلف (تا حدود 13 درصد) حاصل می نمایند که نشان دهنده مقاوم به نویز بودن روش پیشنهادی است.

کلید واژگان: بازشناسی گفتار, استخراج ویژگی, الگوهای زمانی, احتمال پسین, شبکه عصبی, مدل مخفی مارکوف}

چکیده مشاهده متن زبان: فارسی

Robust Speech Recognition Using Temporal Pattern Feature Extracted From MTMLP Structure

Yasser Shekofteh *, Farshad Almasganj

Intelligent Systems in Electrical Engineering, Volume:5 Issue: 3, 2014, PP 23 -36

Temporal Pattern feature of a speech signal could be either extracted from the time domain or via their front-end vectors. This feature includes long-term information of variations in the connected speech units. In this paper، the second approach is followed، i. e. the features which are the cases of temporal computations، consisting of Spectral-based (LFBE) and Cepstrum-based (MFCC) feature vectors، are considered. To extract these features، we use posterior probability-based output of the proposed MTMLP neural networks. The combination of the temporal patterns، which represents the long-term dynamics of the speech signal، together with some traditional features، composed of the MFCC and its first and second derivatives are evaluated in an ASR task. It is shown that the use of such a combined feature vector results in the increase of the phoneme recognition accuracy by more than 1 percent regarding the results of the baseline system، which does not benefit from the long-term temporal patterns. In addition، it is shown that the use of extracted features by the proposed method gives robust recognition under different noise conditions (by 13 percent) and، therefore، the proposed method is a robust feature extraction method.

Keywords: Speech Recognition, Feature Extraction, Temporal Pattern, Posterior Probability, Neural Network, Hidden Markov Model}

Abstract View Paper Original: Persian

نکته

نتایج بر اساس تاریخ انتشار مرتب شده‌اند.
کلیدواژه مورد نظر شما تنها در فیلد کلیدواژگان مقالات جستجو شده‌است. به منظور حذف نتایج غیر مرتبط، جستجو تنها در مقالات مجلاتی انجام شده که با مجله ماخذ هم موضوع هستند.
در صورتی که می‌خواهید جستجو را در همه موضوعات و با شرایط دیگر تکرار کنید به صفحه جستجوی پیشرفته مجلات مراجعه کنید.

به جمع مشترکان مگیران بپیوندید!

جستجوی مقالات مرتبط با کلیدواژه « speech recognition » در نشریات گروه « برق »