Unisound Launches U2-ASR 2.5: The First Large Model for Semantic Transcription of Chinese Dialects

Unisound AI Technology Co., Ltd. launched U2-ASR 2.5, enabling accurate Chinese dialect recognition and semantic understanding.

Newyork, NY, 05/13/2026 / SubmitMyPR /

In January this year, Unisound released UniGPT-Audio 2.0, its flagship speech large model built for the real-world audio environment. With three core capabilities—full-scenario ASR, highly human-like TTS, and full-duplex millisecond-level response—it redefined the performance benchmark for human-machine interaction.

Today, after multiple rounds of algorithm iteration and targeted training on large-scale regional speech corpora, UniGPT-Audio 2.0 has completed a new round of capability upgrades and officially introduced U2-ASR 2.5, the first large model for semantic transcription of Chinese dialects. The model fully covers seven major dialect systems, supports recognition and transcription of more than 100 dialects and regional accents, and reaches over 90% of the dialect-speaking population. On this foundation, it further connects the full chain of “dialect recognition - semantic restoration - Mandarin expression,” enabling obscure, colloquial, and region-specific dialect expressions to be converted into standardized, accurate, and understandable Mandarin text. In other words, AI can not only hear Chinese dialects clearly, but truly understand voices from across the country.

In the latest round of evaluation, U2-ASR 2.5 delivered a robust and compelling set of dialect recognition results. On Unisound’s proprietary industrial-grade dialect test set, UniGPT-Audio outperformed mainstream ASR models overall. From Northern dialect varieties to Southwestern Mandarin, and from Cantonese to Central China accents, multiple dialect recognition accuracy scores exceeded 90%: Jinan dialect reached 96.2%, Sichuan dialect 94.7%, Cantonese 93.0%, and Wuhan dialect 92.1%. These results fully validate UniGPT-Audio’s industry-leading foundation in dialect ASR, especially in challenging scenarios involving significant accent differences, complex regional expressions, and frequent mixing of dialects and Mandarin.


At the same time, U2-ASR 2.5 also demonstrated strong performance on general Chinese and English recognition tasks. Across public test sets including AISHELL, FLEURS, LibriSpeech, WenetSpeech Meeting, and KeSpeech, the model continued to achieve excellent results: AISHELL-1 reached 99.2%, Libri Clean reached 98.4%, and AISHELL-3 reached 98.4%. This means the model does not simply add dialect recognition on top of general ASR. Instead, it extends from a solid Chinese-English speech recognition foundation into the more challenging domain of dialect recognition.

The key breakthrough of this upgrade lies in the fact that, after completing dialect speech transcription, the model further introduces dialect lexical meaning mapping, contextual intent recognition, and Mandarin semantic restoration. It can transform obscure, colloquial, and region-specific dialect expressions into more standardized, accurate, and easily understandable Mandarin text.

01
Technology Breakdown: How Does It Become “Dialect-Fluent”?

Dialect recognition is difficult because it does not deal with a single standardized language system, but with extremely complex real-world voice samples and expression patterns.

Across different regions, age groups, and contexts, the same dialect may vary significantly. The same word may also have different pronunciations, written forms, and meanings in different places. When factors such as recording device differences, environmental noise, speech-rate variation, and dialect-Mandarin code-mixing are added, dialect ASR is never a simple speech-to-text task from the beginning. It is a systematic speech understanding project.

To address this engineering challenge, U2-ASR 2.5 has been systematically optimized across three critical links: data, decoding, and semantic understanding.

Data: First Teach the Model the Sounds of the Real World

The difficulty of dialect recognition often lies not in the model itself, but in the data.

Compared with Mandarin corpora, dialect data naturally faces problems such as scattered samples, inconsistent recording conditions, non-unified transcription standards, and more frequent homophones, variant written forms, and mixed loanwords. To address these challenges, we built a closed-loop data governance system combining real-world data collection, public corpus supplementation, semi-supervised expansion, and manual calibration. Through multiple processing steps such as VAD, denoising, deduplication, utterance segmentation, and confidence filtering, we improved the purity and consistency of trainable data, while using speech synthesis and data augmentation to expand the sample scale.

In response to the reality that even within the same dialect “pronunciation can differ every ten miles,” model training no longer relies on coarse-grained classification by dialect name. Instead, on top of a unified speech foundation, the model learns transferable pronunciation patterns through cross-regional sampling and pronunciation variant modeling, rather than relying on accent templates from a small number of samples. This enables stable recognition across broader dialect areas.

Decoding: Maintaining Continuity and Stability in Mixed-Language Contexts

In real conversations, dialects, Mandarin, and English often do not appear in separate segments. Instead, they alternate and interweave at the word or phrase level. For this reason, we introduced more fine-grained language boundary detection and achieved three technical innovations.

First, a language boundary prediction module is introduced at the model input layer to predict in real time when language switching may occur. Second, a dynamic language attention mechanism is designed to automatically adjust the weighting of dialect, Mandarin, and English language models during decoding based on current speech features. Third, a language-switching corpus at the scale of tens of thousands of hours is constructed to cover common dialect-Mandarin mixed-expression patterns.

From Hearing Clearly to Understanding Truly: A Semantic-Layer Upgrade

This upgrade does not stop at “hearing what was said.” It moves further toward “understanding what was meant.”

After dialect speech transcription is completed, we use dialect lexical meaning mapping, contextual intent recognition, and multi-source knowledge fusion to restore the semantics of the original expression and output Mandarin text that is easier to understand.

This means our large model does more than record dialect content word by word. While preserving the original expression, it can also normalize and interpret it, providing clearer and more usable input for subsequent capabilities such as intent understanding and task execution.

From this perspective, U2-ASR 2.5 is not only an upgrade in ASR recognition capability, but also a leap forward in speech understanding.

02
From “Can Recognize” to “Can Recognize Reliably”: The Engineering Challenge of Dialect Speech

In real business scenarios, a model must not only recognize accurately, but also remain stable under complex conditions such as noise, device differences, multi-speaker concurrency, and long-duration operation. What Unisound cares about more is whether speech capability can move from laboratory tests to industrial-grade deployment.

With this goal in mind, U2-ASR 2.5 builds an end-to-end engineering system that spans front-end signal processing, model adaptation, hotword enhancement, inference optimization, and back-end error correction. This enables dialect recognition not only to “score high,” but also to “work stably.”

High Recognition Accuracy: Win First on Accuracy, Then Win in Complex Scenarios

In dialect speech recognition, accuracy depends not only on whether the model can “hear” the dialect, but also on whether it can stably understand user intent amid complex inputs such as accent differences, dialect-Mandarin code-mixing, and colloquial expressions.

From Mandarin and Jin to Wu and Xiang, and from Gan and Min to Hakka and Yue/Cantonese, U2-ASR 2.5 continues to expand its capability boundaries across multiple major Chinese dialect systems. It covers real expression scenarios across northern and southern regions, multiple language families, and diverse accents. In representative system samples, it demonstrates more stable and accurate dialect recognition capability. On Unisound’s proprietary industrial-grade dialect test set, its overall recognition performance leads mainstream ASR models.

At the same time, U2-ASR 2.5 maintains excellent performance on public Chinese-English test sets such as AISHELL, LibriSpeech, and FLEURS, further validating its solid general ASR foundation.

This means U2-ASR 2.5 is not merely “scoring high” on a single dialect. It continues to lead across broader, more complex, and more real-world speech scenarios. It can cover richer regional expressions and adapt to more complex accent differences, bringing dialect speech recognition from “usable” to “easy to use.”

High-Noise Recognition: Understanding Both Night Markets and Hospital Waiting Areas

The real world is never a recording studio. At breakfast stalls, night markets, government service halls, hospital waiting areas, and customer service centers, background noise is complex, speakers may be at varying distances, and multiple voices may overlap. Traditional ASR models can easily produce missed recognition, incorrect recognition, and semantic breaks.

Before speech enters the model, U2-ASR 2.5 uses multi-channel denoising, adaptive echo cancellation, and non-stationary noise optimization to preprocess complex acoustic interference, suppressing noise while preserving valid speech information as much as possible. Combined with robustness modeling and endpoint detection optimization, the model can capture valid speech more accurately and reduce the impact of device differences and environmental noise. Even in high-noise and high-interference real-world scenarios, it can maintain strong recognition stability.

Domain Enhancement: Understanding Both Dialects and Business Context

In scenarios such as healthcare, government services, and customer service, user expressions often include not only dialects, but also a large number of professional terms, business terms, and proper nouns.

Unisound supports dynamic hotword injection and domain lexicon adaptation. For professional scenarios such as healthcare, government services, and customer service, it can enhance recognition for high-frequency terms, proper nouns, and business keywords, reducing the probability of misrecognition and making dialect recognition results closer to the intended business semantics.

This is also an important capability that differentiates U2-ASR 2.5 from ordinary ASR models: it understands not only language, but also scenarios.

Low-Latency Response: Stronger Recognition, Faster Response

Through model quantization, operator fusion, streaming decoding, and server-side concurrency scheduling optimization, U2-ASR 2.5 compresses the inference pipeline and reduces the computational overhead introduced by complex dialect recognition. At the same time, through re-scoring and error-correction mechanisms, it verifies and corrects fine-grained issues such as pronoun confusion, misrecognition of modal particles, and colloquial expressions, making output results not only faster, but also more stable and usable.

03
Application Scenarios: Bringing Technology Back to Human Warmth

In China, dialects remain the most natural and familiar way of communication for many people. Especially in government services, healthcare, customer service, and elderly-friendly services, differences in language habits may still affect the efficiency of information delivery and the service experience.

In the era of large models, speech interaction should not only adapt to standardized expression. It should also better understand the natural expressions of real people.

Smart Government Services

At grassroots government service windows and convenient service terminals, people are often more accustomed to expressing their needs in dialect. U2-ASR 2.5 can help systems understand dialect expressions more accurately and convert them into standardized, processable Mandarin text, reducing the cost of repeated communication and allowing public services to reach users in different regions more naturally.

Smart Healthcare

In hospital guidance, consultation records, follow-up communication, and other scenarios, patients’ accents, expression habits, and professional terms are intertwined, which can easily affect the efficiency of information recording and understanding. Through anti-noise optimization and medical hotword enhancement, U2-ASR 2.5 can help systems more stably recognize patients’ chief complaints and key information, reducing communication costs caused by accent differences.

Smart Finance and Insurance

In banking, insurance, claims settlement, and other scenarios, user expressions often include dialect accents, colloquial descriptions, financial and insurance terms, and complex business information. If key information is not accurately recognized, subsequent verification, review, and service efficiency may be affected. By combining dialect recognition, domain hotword enhancement, and semantic understanding, U2-ASR 2.5 can more stably identify key information such as claims, disease names, coverage scope, and expense details, while converting colloquial and dialectal expressions into standardized, processable Mandarin text. This enhances the accuracy, traceability, and service credibility of claims document organization and risk review.

Smart Customer Service

In regions where dialects are frequently used, users are not always willing or able to switch to standard Mandarin. For hotline customer service, intelligent outbound calling, intelligent agent assistance, and related scenarios, U2-ASR 2.5 supports more natural dialect expression recognition, helping customer service systems understand user needs faster, reduce repeated confirmations, and improve service efficiency and interaction experience.

Culture, Tourism and Content Creation

In cultural and tourism promotion, documentary production, local cultural documentation, and related scenarios, large volumes of authentic and vivid dialect materials are often difficult to organize and distribute efficiently. U2-ASR 2.5 can convert dialect speech into text content that is easier to understand, edit, and search, providing new technical support for local cultural communication, intangible cultural heritage documentation, and content production.

Every dialect is a complete system of meaning, carrying local life experience and cultural memory. Understanding dialects is not merely recognizing a piece of audio; it is accurately capturing user intent amid complex accents, mixed expressions, and real contexts. The launch of U2-ASR 2.5 marks Unisound’s exploration from “hearing clearly” to “understanding truly.”

Looking ahead, Unisound will continue to expand its dialect speech capabilities to cover richer regional expressions, more complex real-world scenarios, and more diverse user needs, enabling AI to truly understand everyone’s natural expression.

Currently, the UniGPT-Audio model family, including U2-ASR, U2-TTS, and U2-TTS-Clone, has been fully launched on Unisound’s Token Hub large model service platform. Standard APIs are now available, supporting one-click integration, on-demand invocation, token-based billing, and flexible, controllable usage.

Experience now:

U2-ASR: https://maas.unisound.com/models/asr

U2-TTS: https://maas.unisound.com/models/tts

U2-TTS-Clone: https://maas.unisound.com/models/clone

Company Name: Unisound AI Technology Co., Ltd.

Contact Person: Zhou Ziding

Email:zhouziding@unisound.com

Country:China

Website:https://www.unisound.com/

Disclaimer: This press release may contain forward-looking statements. Forward-looking statements describe future expectations, plans, results, or strategies (including product offerings, regulatory plans and business plans) and may change without notice. You are cautioned that such statements are subject to a multitude of risks and uncertainties that could cause future circumstances, events, or results to differ materially from those projected in the forward-looking statements, including the risks that actual results may differ materially from those projected in the forward-looking statements.

Original Source of the original story >> Unisound Launches U2-ASR 2.5: The First Large Model for Semantic Transcription of Chinese Dialects