Current technological advances have led to our dependence on machines in carrying out daily activities. However, breaches of highly sensitive data have raised awareness that information that can be accessed online is not secure. Some of these breaches occur through human interaction and have therefore led the scientific community to search for alternative methods to minimize the risk of unauthorized access to personal data. One of the fields that has gained considerable attention is biometrics. Biometrics involves the use of physical characteristics of the human body that are unique to that person, such as fingerprints, iris, and voice, in order to verify that person's identity. Problem Statement For the listener, the speech signal carries many layers of information. Although speech conveys a message using words, it also contains information about the speaker's gender, emotions, language, and overall identity. This section discusses the motivation behind the use of speaker recognition and the general categories and tasks associated with it, followed by discussion of the purpose of this work and the proposed solution. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an original essay Reason With the growing number of services accessed via phone, web, or mobile apps, maintaining and remembering multiple passwords, PINs, and authentication details required to gain access to accounts remotely has become more challenging. Especially since security experts encourage the use of different authentications for different accounts. Meanwhile, with the currently existing infrastructure, the speaker's identity is biometric and can be easily tested for remote access applications. This makes speaker recognition valuable for many real-world applications. Definition Speaker recognition involves identifying the speaker based on the words he or she speaks and can be divided into two categories, text-dependent and text-independent. The text-dependent mode requires the speaker to say the same word used for feature extraction, while the text-independent mode can identify the speaker regardless of the words mentioned. Text-dependent speaker recognition has prior knowledge of the text to be spoken. Text-independent speaker recognition is based on the speaker's physiological characteristics and makes no assumptions about the speech context. Speaker recognition can be divided into two general tasks, namely speaker identification and speaker verification. Speaker identification involves determining who speaks from a group of known voices or speakers. Speaker verification involves determining whether a person is who they claim to be. Aim of the thesis The aim of this work is to present an experimental evaluation of feature extraction techniques that could be used for text-independent speaker verification. Feature extraction is the process of extracting speaker-specific properties from the raw signal and storing them in a feature vector. The speech signal consists of many characteristics, which may not all be important for speaker verification. A good feature should include the following characteristics: It should discriminate between speakers while having little within-speaker variability. Be noise resistant. It occurs frequently and naturally in speech. Be easy to extract from the voice signal. It shouldn'tbe susceptible to mimicry. It should be stable over time and not affected by the health of the speakers. Meanwhile, the number of features should also be considered since the number of training samples required for reliable density estimation grows exponentially with the number of features. Furthermore, computational savings are also achieved with lower dimensional features. Proposed Solution One of the structures of any speaker verification system is front-end processing. Front-end processing generally consists of some form of speech activity detection (VAD) to remove non-speech sections of the signal, followed by extracting features that contain the speaker's identity from the speech signal. The features extracted from the vectors are then used to build a model of the speaker or to test them against the model and decide whether the person is who they claim to be. But, before proceeding to front-end processing, a voice signal is required. A dataset consisting of 44 words is recorded using 12 male volunteers and 12 female volunteers raised in the province of Manitoba. The choice of these 44 words was due to having enough speech data to build a model and, at the same time, recording it quickly, making it practical to use for a real-world application. Furthermore, the origin of the volunteers from a specific geographical location may limit the variety of accents and forms of speech and therefore the analysis will be based on the physiological factors of the speaker. The traditional approach to solving the speaker recognition problem involved the use of linear methods. However, the speech production process is not linear. Speech has nonlinear characteristics and its multifractal nature has been demonstrated. A VAD based on the fractal dimension (FD) is used to separate the non-speech segments of the signal. The choice of FD is due to the estimation of the FD based on the complexity of the signal and not on the amplitude. Fusion is the combination of information from multiple sources [KiLi10], which is used to combine the nonlinear method with traditional methods and form the feature vectors. The features used to form the feature vector are linear prediction cepstral coefficients (LPCC), Mel frequency cepstral coefficients (MFCC), Higuchi fractal dimension (HFD), variance fractal dimension (VFD), rate zero crossing (ZCR) and turns. counting (TC). The theory and programming of these algorithms are discussed in depth in Chapter 3 and the motivation behind their use is discussed in Section 5.4. After extracting the feature vectors, the Support Vector Machine (SVM) is used to build a model of the speaker and test against unseen data. The choice of SVM is due to the availability of different kernel functions suitable for different types of functionality and the availability of highly optimized libraries that could be used. Thesis Formulation This thesis comprises three parts which include recording a dataset, front-end processing and classification. The next section discusses the thesis statement followed by the thesis objective and research questions. Thesis Statement The core of this thesis is to evaluate the suitability of incorporating fractal methods, due to the nature of speech, into the front-end process of a speaker verification system and investigate the effectiveness of these methods. But before proceeding to the front-end processes of any speaker verification system, you need a voice signal. Therefore, volunteers are recorded and the acquired signals are stored in an archive..
tags