
最新阿里云云产品活动优惠券领取,阿里云文档智能基于多年技术积累打造的多模态文档识别与理解引擎,为用户提供各类文档文字提取和文档处理,支持通用场景、行业场景和自定义场景下的多样化文档处理需求。
To jointly optimize the waveform-domain adversarial loss, we employ multi-period discriminator (MPD)[20-21] and multi-scale discriminator (MSD)[20-21] to identify speech signal from two different perspectives. The MSD method is derived from the MelGAN vocoder. Through the average pooling operation, the length of the speech sequence is halved successively. Then the convolution operation is performed on the speech signals of different scales. Finally, it is flattened and output. The MPD method folds the single-channel audio sequence into a two-channel audio with different fixed-lengths called period, and then apply 2-D convolution on the folded data. The disadvantage of this approach is the folded data on each channel is mixed with artifacts of different frequencies. In edy order to make up for this defect, we proposed multi-length discriminator (MLD), to improve ability of discriminating synthetic or real audio as much as possible. Firstly, the single-channel audio is folded into multi-channel audio by wavelet transform[20]. Then apply 1-D dilated convolution as in [22]. In this way, each channel in the folded data contains few or no artifacts of other frequencies, ensuring the stability and accuracy of the discrimination.
The generator of PLCNet is a symmetric encoder-decoder structure with skip connections and residual units. The encoder and decoder each have 4 sub-modules, each sub-module of the encoder consists of 3 residual units and a down-sampling module and each sub-module of the decoder consists of an up-sampling module and 3 residual units. The residual unit alternately uses 1-D dilated convolution with kernel size of 7and 1-D convolution with kernel size of 1. The dilation rate is gradually increased using (1,3, 9). The input is first transformed by 1-D convolution with kernel size of 7, then the encoder maps the 16khz waveform to the 50hz representation through down-sampling block of (2,4,5,8) in the form of a stride convolution. The decoder uses the transposed convolution method to up-sampling in reverse order, restores the features to the same dimension as the speech. The number of channels is doubled when down-sampling and halved when up-sampling. The middle bottleneck layer acts as a bridge between encoder and decoder and consists of 3 1-D convolutions with kernel size of7. A skip-connection is used between the corresponding layers of the encoder and decoder to allow information such as phase or alignment to pass through. We use the ELU activation function [19] and weight normalization in the generator to guarantee the stability of adversarial training. Finally, the output of the decoder is a mono signal, with tanh limiting the output range to [-1,1]. To be able to process real-time audio streams on low-power mobile devices, all our convolutions are causal.

数据统计
温馨提示
关于文档智能特别声明
本站阿里云导航提供的文档智能优惠活动内容、折扣信息、优惠券、优惠码、免费试用入口等内容都来源于阿里云官方公开信息和公开渠道,不保证优惠折扣额的准确性,优惠金额应该以阿里云官方实时显示折扣金额为准!同时,用户通过本网站访问的活动链接、参与的优惠活动或购买行为,均属于用户与阿里云之间的独立关系,本网站不承担任何责任。
相关导航

阿里云实时语音识别是对不限时长的音频流进行实时语音转文字处理,采用业界领先的端到端识别模型,通用字准确率90%以上,用于直播字幕、实时会议、法庭庭审记录等。

机器翻译
阿里云机器翻译基于先进NLP技术,提供覆盖214种语言的多模态翻译服务,支持文本、文档、图片、语音和视频翻译。应用于跨境电商、教育、医疗、会议同传等场景,确保99.9%服务可靠性及高效人机协同解决方案。

文字识别
阿里云OCR文字识别是可以将图片识别文字的数据智能产品,支持印刷品、卡证、票据、图片、文档等多类文件,具备全栈全场景的文字识别能力,

智能数据标注 PAI-iTAG
智能数据标注 PAI-iTAG 是一款专业的智能化数据标注平台,支持图像、文本、音视频等多模态数据,内置智能预标注工具与精细化任务管理,助您高效获取高质量训练数据,加速AI应用落地。

自然语言处理 NLP
自然语言处理是为各类企业及开发者提供的用于文本分析及挖掘的核心工具,已经广泛应用在电商、文化娱乐、金融、物流等行业客户的多项业务中。自然语言处理API可帮助用户搭建内容搜索、内容推荐、舆情识别及分析、文本结构化、对话机器人等智能产品,也能够通过合作,定制个性化的解决方案。

向量检索服务 DashVector
阿里云向量检索服务提供全托管、云原生的高效向量检索能力,专为大模型知识库与多模态AI搜索等场景设计,支持亿级向量数据的高性能检索,赋能AI应用开发,加速您的业务创新。

无影 Agent 开发套件 AgentBay
无影AgentBay是阿里云推出的全场景AI Agent执行平台,提供浏览器、云电脑、代码空间、云手机四大环境支持,具备秒级弹性伸缩、千级并发运维能力,集成企业级安全容器方案,助力深度研究、金融分析等场景的智能体高效运行。

智能对话机器人
智能对话机器人是基于通义千问大模型的7*24小时智能客服解决方案。它能快速学习文档与网页知识,提供流畅拟人的对话体验,全面提升企业服务效率与客户满意度。
暂无评论...
