216 Commits

Author SHA1 Message Date
azalea 63de471300 [F] Use mp3 instead of ogg 2024-07-14 06:19:04 +08:00
azalea 7885045775 [O] Too long, don't read 2024-07-13 20:43:01 +08:00
azalea 8fa361c83b [F] Fix silent \n 2024-07-13 08:49:18 +08:00
azalea 58e2cf78e6 [O] Allow CORS 2024-07-13 04:23:20 +08:00
azalea 61f86358ac [+] x-www-form-data compatibility 2024-07-13 03:42:44 +08:00
azalea 7c84219238 [-] Remove .idea 2024-07-13 02:52:32 +08:00
azalea a3e0bc1a82 [+] Inference API 2024-07-13 02:51:16 +08:00
azalea d9345d73fa [-] Remove unused models_infer 2024-07-13 02:40:00 +08:00
Songting 96de086509 revert to whisper large v2 2023-11-16 21:08:30 +08:00
Songting bcce0787b3 Merge pull request #527 from DogeLord081/main
Update short_audio_transcribe.py to fix Whisper no short audios found issue
2023-11-14 17:14:46 +08:00
Songting aa3b066668 Adaptation to whisper large-v3 update 2023-11-14 17:13:54 +08:00
Danu Kim 1f102f3c09 Update short_audio_transcribe.py 2023-11-13 16:56:55 -05:00
Songting 3a19b67247 Merge pull request #506 from ALECX123/main
fix issue #156 UnicodeEncodeError
2023-10-28 14:36:10 +08:00
Alexc123 1c99b4120b fix issue #156 UnicodeEncodeError 2023-10-27 12:24:07 +08:00
Songting 739105573d Update requirements.txt 2023-10-21 23:08:58 +08:00
Songting 8b5acbc877 changed pyopenjtalk to pyopenjtalk-prebuilt 2023-09-16 17:37:25 +08:00
Songting e7dd856db8 Merge pull request #424 from ufownl/main
Fix the video2audio script
2023-08-30 20:48:39 +08:00
RangerUFO e3a941bad2 Fix the video2audio script
Prevent stripping of the "4" character at the end of the random number.
2023-08-30 20:40:03 +08:00
Songting 7946460e82 Merge pull request #395 from eltociear/main-1
Fix typo in modules.py
2023-08-14 00:24:34 +08:00
Ikko Eltociear Ashimine 4ab49d946a Fix typo in modules.py
Dialted -> Dilated
2023-08-14 01:14:29 +09:00
Songting d2b7db3687 Merge pull request #359 from ak8893893/main
Update LOCAL.md
2023-07-23 22:11:24 +08:00
Songting bd90394d1d Update README_ZH.md 2023-07-23 13:53:44 +08:00
Songting 3cc4ecedd5 Update README.md 2023-07-23 13:53:24 +08:00
Songting e84523e572 Update requirements.txt 2023-07-23 13:03:17 +08:00
AK becf5145e5 Merge branch 'Plachtaa:main' into main 2023-07-23 12:51:54 +08:00
AK e54309d74d Update LOCAL.md
correct spelling error
2023-07-23 12:51:27 +08:00
Songting 8a9535c0c7 Update README_ZH.md 2023-07-22 22:52:56 +08:00
Songting 2262cb8212 Update README.md 2023-07-22 22:52:14 +08:00
Songting 496dd8486a Update README_ZH.md 2023-07-22 22:46:48 +08:00
Songting e9441dbc7f Update README.md 2023-07-22 22:22:44 +08:00
Songting e01c21e65f Update README_ZH.md 2023-07-22 22:22:18 +08:00
Songting 758761ff99 Update README.md 2023-07-22 22:20:03 +08:00
Songting cad389b2ee Merge pull request #357 from ak8893893/main
Update preprocess_v2.py
2023-07-22 16:11:20 +08:00
AK 46d68bcf8e Update preprocess_v2.py
add comment
2023-07-22 01:30:32 +08:00
AK 2006097768 Update preprocess_v2.py
fix the error message of RecursionError: maximum recursion depth exceeded while calling a Python object
2023-07-22 01:16:35 +08:00
Songting 963089cc90 Update README_ZH.md 2023-07-18 14:46:37 +08:00
Songting 2e116a44ab Update README.md 2023-07-18 14:46:02 +08:00
Plachta a5a0fed4e1 Checkpoints will be saved to google drive during training 2023-07-13 20:33:47 +08:00
Songting cec3206028 Merge pull request #328 from ak8893893/main
Update LOCAL.md
2023-07-11 20:20:05 +08:00
AK ccaa1db0e3 Update LOCAL.md
modify the command from "python3.8" -> "python"
2023-07-11 18:23:38 +08:00
AK 0cdb8554b5 Update LOCAL.md
Add remove "separated" folder's training data
2023-07-11 18:15:10 +08:00
AK 1ba65b1b55 Update LOCAL.md
Add Windows version delete training data command
2023-07-11 18:02:37 +08:00
AK f2f877762a Update LOCAL.md 2023-07-11 17:45:27 +08:00
Songting b888c11b33 Merge pull request #325 from ak8893893/patch-1
Update DATA_EN.MD
2023-07-10 23:10:25 +08:00
AK ac21cd274e Update DATA_EN.MD 2023-07-10 22:15:22 +08:00
Songting 86d23945b6 Update LOCAL.md 2023-07-02 10:35:58 +08:00
Songting 460222c845 Merge pull request #310 from cloudxinn/patch-1
Update LOCAL.md
2023-07-02 10:35:17 +08:00
cloudxinn 7abb84e8b6 Update LOCAL.md
correct the command of step 2
2023-06-30 18:31:53 +08:00
Songting 3938f0581b Update LOCAL.md 2023-06-21 15:02:51 +08:00
Songting 3ba2c740f9 Update LOCAL.md 2023-06-21 02:04:08 +08:00
Songting 7c9bda5ec9 Update LOCAL.md 2023-06-21 02:03:43 +08:00
Songting 5c8d5aa943 Added ffmpeg dependency 2023-06-20 17:02:37 +08:00
Songting aa6feb1178 Added ffmpeg dependency 2023-06-20 17:01:32 +08:00
Songting b31e0a414c Added ffmpeg dependency 2023-06-20 17:00:09 +08:00
Songting 2ca4200641 Added author name of the Chinese model 2023-06-20 16:55:16 +08:00
Songting 18a3d273e8 Added author name of the Chinese model 2023-06-20 16:53:58 +08:00
Songting a24999f41b Merge pull request #281 from Artrajz/main
修改继续训练的部分以及增加从零开始训练选项
2023-06-14 17:23:09 +08:00
Plachta 809232f8d8 Merge remote-tracking branch 'origin/main' 2023-06-14 15:36:59 +08:00
Plachta 9cd6f36696 Updated README_ZH.md 2023-06-14 15:36:48 +08:00
Emberstar a9b43a8afc update: 可选是否保存旧模型 2023-06-14 15:33:00 +08:00
Emberstar 6d65db1f76 fix: 布尔类型参数在传入False时会当成字符串而变成True 2023-06-14 12:23:40 +08:00
Emberstar 576424fe58 修改加载latest model的方式,修改global_step计算,增加preserved参数,增加train_with_pretrained_model参数 2023-06-14 10:18:19 +08:00
Songting e97b185188 Update LOCAL.md 2023-06-13 02:52:15 +08:00
Songting 76c9cb239d Update LOCAL.md 2023-06-13 02:49:02 +08:00
Plachta cb1f29d1ed Added capability of continue training from previous checkpoints 2023-06-12 19:25:30 +08:00
Plachta 9bbc9e9246 Added capability of continue training from previous checkpoints 2023-06-12 19:20:32 +08:00
Plachta 3b8d7b5ef4 Added capability of continue training from previous checkpoints 2023-06-12 19:11:52 +08:00
Plachta 631f97eff7 Added capability of continue training from previous checkpoints 2023-06-12 19:02:32 +08:00
Plachta 291d8ddf5e Added capability of continue training from previous checkpoints 2023-06-12 18:42:05 +08:00
Plachta 1d7e8fc637 Added guidance for training on local machine 2023-06-12 18:26:59 +08:00
Plachta 9398660323 Added guidance for training on local machine 2023-06-12 18:20:28 +08:00
Songting 889433e6f4 Merge pull request #244 from alaister123/main
Fix Error: short_audio_transcribe.py ValueError with whisper
2023-05-24 10:17:49 +08:00
GSSun 6959ddcfc2 Update short_audio_transcribe.py 2023-05-24 04:00:49 +08:00
Plachta 34c5d91a85 added beam search for short audio annotations 2023-05-23 21:30:58 +08:00
Songting bf560f2a68 Merge pull request #228 from FrankZxShen/patch-1
Fix error 'integer division or modulo by zero' in data_utils.py
2023-05-12 11:21:37 +08:00
FrankZxShen f1a6b82feb Fix error 'integer division or modulo by zero' in data_utils.py
#27 #50 #222 #227 的一种可选择解决方案
2023-05-12 10:55:10 +08:00
Songting 1b2c9b9631 Merge pull request #223 from bhj8/main
Increase whisper progress
2023-05-10 17:36:24 +08:00
鲍洪江 83a6042731 Increase whisper progress 2023-05-08 16:11:31 +08:00
Plachta 3ae932ddaf updated VC_inference.py 2023-04-28 13:34:13 +08:00
Plachta f44da7617f rearranged repo 2023-04-28 12:49:15 +08:00
Songting 9459e87253 Update requirements.txt 2023-04-28 12:18:29 +08:00
Songting 7a0c67c8c3 Update requirements.txt 2023-04-28 12:05:58 +08:00
Songting 6b808d4a31 Update requirements.txt 2023-04-28 12:03:50 +08:00
Songting 6444839ec0 Update requirements.txt 2023-04-28 11:27:00 +08:00
Songting 8a4fdd263a Update requirements.txt 2023-04-28 10:39:41 +08:00
Songting 0fe10b449e Update preprocess_v2.py 2023-04-25 14:19:19 +08:00
Plachta f8b398f587 rearranged repo 2023-04-22 09:51:41 +08:00
Plachta 3d7e4220d4 rearranged repo 2023-04-21 21:39:41 +08:00
Plachta 2612e5dbcc rearranged repo 2023-04-21 21:19:26 +08:00
Plachta eb7eb8a022 rearranged repo 2023-04-21 21:17:45 +08:00
Plachta 05dbf649a1 Merge remote-tracking branch 'origin/main' 2023-04-21 20:43:12 +08:00
Plachta e33f8919d0 added new base model (pure Chinese) 2023-04-21 20:43:02 +08:00
Songting 8e1893daf7 Update cleaners.py 2023-04-20 15:19:52 +08:00
Songting b3e7ad0e50 Update download_video.py 2023-04-19 17:24:39 +08:00
Songting b35f4bc727 Merge pull request #187 from huangynn/patch-1
Update download_video.py
2023-04-18 15:40:25 +08:00
EvelynH 2f9f7c4b31 Update download_video.py
solve SSL: CERTIFICATE_VERIFY_FAILED problem
2023-04-17 16:27:57 +08:00
Songting 8160bd71d0 Merge pull request #180 from himekifee/main
Discord link
2023-04-15 18:43:12 +08:00
Grider 2c1276b8d9 Discord link 2023-04-15 09:22:27 +01:00
Songting 1f19649e92 Update long_audio_transcribe.py 2023-04-12 14:02:40 +08:00
Songting 0f1fc8cb99 Update requirements.txt 2023-04-11 15:12:51 +08:00
Songting 4f316f2f64 Update mel_processing.py 2023-04-04 18:47:20 +08:00
Songting c90fb9f63c Merge pull request #148 from RealityError/main
用命令行方法生成VITS语音
2023-04-04 12:55:39 +08:00
Songting 02e3fd0a09 Update requirements.txt 2023-04-04 12:46:50 +08:00
Songting 41173bfec7 Adaptation to torch 2.0 2023-04-04 12:32:10 +08:00
realityerror c4ab2501e4 提交命令行生成
命令行生成语音文件
2023-04-04 11:59:01 +08:00
Plachtaa ff4078c098 Update requirements.txt 2023-03-21 21:34:19 +08:00
Plachtaa 7b4273f514 Update DATA.MD 2023-03-14 15:17:44 +08:00
Plachtaa cbadc7c0db Update README_ZH.md 2023-03-14 10:42:14 +08:00
Plachtaa 56972cc455 Update README.md 2023-03-14 10:41:02 +08:00
Plachta 94a36f612f Merge remote-tracking branch 'origin/main' 2023-03-09 13:22:40 +08:00
Plachta 0630c258ea changed num_workers 2023-03-09 13:22:27 +08:00
Plachtaa d79aadb786 Added additional instructions for MoeGoe usage 2023-03-09 12:49:20 +08:00
Plachtaa aa01f7b73e Merge pull request #76 from lrioxh/main
中英日韩文本加tag
2023-03-07 13:37:39 +08:00
Plachta eff14d53e5 fix inference UI 2023-03-07 10:36:02 +08:00
Plachta d06efc5fe7 fix inference UI 2023-03-07 10:23:41 +08:00
lrioxh e7f0574fc8 Add tags via re 2023-03-07 09:58:05 +08:00
lrioxh 5aea258e26 Add tags via re 2023-03-07 09:44:43 +08:00
lrioxh ea491a457f Add tags via re 2023-03-07 09:39:56 +08:00
Plachta 5093ba0b9a supported auxiliary data for CJ model 2023-03-06 21:00:40 +08:00
Plachta bbe2638855 Merge remote-tracking branch 'origin/main' 2023-03-06 16:21:20 +08:00
Plachta 8a04e3e824 bug fix 2023-03-06 16:21:10 +08:00
Plachtaa fbe46caa3d Add files via upload 2023-03-05 16:33:29 +08:00
Plachtaa 1fa2fa8642 Update video2audio.py 2023-03-05 16:25:52 +08:00
Plachtaa 1698f0dd8a Update download_video.py 2023-03-05 16:25:38 +08:00
Plachtaa 11b2ea0f88 Merge pull request #60 from eve2ptp/main
multithreading event
2023-03-05 16:25:14 +08:00
eve2ptp 4a17d6580c Delete denoise_audio.py 2023-03-05 16:18:39 +08:00
Evilmass 4c45e6a74c multithreading event 2023-03-05 14:55:32 +08:00
Plachta de2a885ab2 will print transcribed text from long audios 2023-03-04 19:02:57 +08:00
Plachta f45e33c1ad Merge remote-tracking branch 'origin/main' 2023-03-04 14:09:44 +08:00
Plachta 15c4db56ba will print transcribed text from long audios 2023-03-04 14:09:30 +08:00
Plachtaa da3670a2d8 Merge pull request #49 from BushyToaster88/fix-torch-stft-error-on-gpus-sm-53
fix-torch-stft-error-on-gpus-sm-53
2023-03-03 14:10:23 +08:00
Exulan b92361b99f fix-torch-stft-error-on-gpus-sm-53
This pull request addresses an issue that arises when executing the finetune_speaker_v2.py script on GPUs with compute capability less than SM_53. The error occurs at line 104 of mel_processing.py, where the torch.stft() function is called with a half data type. To fix this, I updated the data type to float.
2023-03-03 16:17:28 +11:00
Plachtaa 8137fb9e06 Merge pull request #44 from justinjohn0306/main
Use gloo backend on Windows for Pytorch
2023-03-02 16:50:40 +08:00
Justin John 04b5e8e68c Merge branch 'Plachtaa:main' into main 2023-03-02 13:51:48 +05:30
Plachta 6addfeb97f fixed single speaker error 2023-03-02 14:42:25 +08:00
Plachta 13350512c7 updated error messages 2023-03-02 13:36:47 +08:00
Plachta 610f4a8b02 updated error messages 2023-03-02 13:34:32 +08:00
Plachta 03c9cd3ccb updated pure chinese option 2023-03-01 17:50:11 +08:00
Plachta 9f8982181d updated pure chinese option 2023-03-01 17:11:58 +08:00
Plachta 89eb6f5a83 Merge remote-tracking branch 'origin/main' 2023-03-01 16:59:07 +08:00
Plachta 952ce4b3a3 updated chinese cleaners 2023-03-01 16:58:57 +08:00
Plachtaa 86a60660bb Update DATA_EN.MD 2023-03-01 10:18:43 +08:00
Plachta e90e1b86f1 updated pipeline 2023-02-27 22:05:28 +08:00
Justin John 3482b9ce55 downgrade librosa 2023-02-27 11:30:34 +05:30
Justin John 730365fcde Use gloo backend on Windows for Pytorch 2023-02-27 10:36:35 +05:30
Plachta bf7042e454 updated pipeline 2023-02-27 00:39:32 +08:00
Plachta 03e3894fe4 updated pipeline 2023-02-27 00:22:00 +08:00
Plachta 4b2ea54c96 updated pipeline 2023-02-26 23:42:15 +08:00
Plachta 9c471d14ac updated pipeline 2023-02-26 22:15:01 +08:00
Plachta d212e70381 updated pipeline 2023-02-26 21:14:42 +08:00
Plachta e15159ba55 updated pipeline 2023-02-26 21:13:55 +08:00
Plachta ce418b1888 updated pipeline 2023-02-26 20:31:49 +08:00
Plachta 7d5b2f8547 updated pipeline 2023-02-26 20:31:00 +08:00
Plachta 7a1a8216d0 updated pipeline 2023-02-26 20:30:31 +08:00
Plachta 1a873a6f83 updated pipeline 2023-02-26 20:23:53 +08:00
Plachta 46cc84aec2 updated pipeline 2023-02-26 20:22:50 +08:00
Plachta 626ace0132 updated pipeline 2023-02-26 20:09:59 +08:00
Plachta 7d5c196cc7 updated pipeline 2023-02-26 20:08:37 +08:00
Plachta 6230f6e5b3 Merge remote-tracking branch 'origin/main'
# Conflicts:
#	README_ZH.md
2023-02-26 20:06:49 +08:00
Plachta 5b228b39be updated pipeline 2023-02-26 20:06:31 +08:00
Plachtaa b88092dd3a Update README_ZH.md 2023-02-25 18:28:49 +08:00
Plachtaa 8370ea7f35 Update README_ZH.md 2023-02-25 10:13:21 +08:00
Plachtaa b18bc53b12 Update README.md 2023-02-25 10:13:01 +08:00
Plachta cfa4cc9878 upload files 2023-02-23 17:15:14 +08:00
Plachta 4f09046710 upload files 2023-02-23 17:11:09 +08:00
Plachta 60bd93a26f upload files 2023-02-23 15:02:32 +08:00
Plachta 1bae830002 upload files 2023-02-23 14:57:46 +08:00
Plachta 6ed2e432fb upload files 2023-02-23 14:53:04 +08:00
Plachta 649919cca3 Merge remote-tracking branch 'origin/main' 2023-02-23 14:32:06 +08:00
Plachta 66d4920e69 upload files 2023-02-23 14:31:34 +08:00
Plachtaa a89610a6e4 Update preprocess.py 2023-02-21 17:21:57 +08:00
Plachta 98f6e845b6 upload files 2023-02-20 10:27:46 +08:00
Plachta 883eb06769 upload files 2023-02-20 00:54:20 +08:00
Plachta 20d91c6249 Merge remote-tracking branch 'origin/main' 2023-02-19 22:26:50 +08:00
Plachta 674ddf346a upload files 2023-02-19 22:26:21 +08:00
Plachta fc838f7e5f upload files 2023-02-19 22:25:37 +08:00
Plachtaa 95623e77a3 Update README.md 2023-02-19 18:17:39 +08:00
Plachtaa fce6ee58a5 Update README_EN.md 2023-02-19 18:17:24 +08:00
Plachtaa f5b2bcbaa4 Update README_ZH.md 2023-02-19 18:17:07 +08:00
Plachtaa 78fb90d9fc Update README_EN.md 2023-02-19 00:50:06 +08:00
Plachtaa 3ebc1befe5 Update README.md 2023-02-19 00:49:46 +08:00
Plachta 9cb391af38 upload files 2023-02-19 00:48:13 +08:00
Plachta 030bde4914 upload files 2023-02-19 00:47:10 +08:00
Plachta 93d5e26434 upload files 2023-02-19 00:46:39 +08:00
Plachta 22bc07925c upload files 2023-02-18 18:31:52 +08:00
Plachta e8fa96308c upload files 2023-02-18 07:13:43 +08:00
Plachta 3f40898dd1 upload files 2023-02-17 10:21:34 +08:00
Plachta 56f0dadc47 upload files 2023-02-16 20:35:35 +08:00
Plachta da6736fbf7 upload files 2023-02-16 20:20:06 +08:00
Plachta cbd32eb8b5 upload files 2023-02-16 20:19:45 +08:00
Plachta cc342ddc71 upload files 2023-02-16 20:17:21 +08:00
Plachta 9c3087ccba upload files 2023-02-16 20:15:00 +08:00
Plachta 2358177995 upload files 2023-02-16 20:14:27 +08:00
Plachta 2fdb60911c upload files 2023-02-16 20:00:53 +08:00
Plachta 044d386682 upload files 2023-02-16 19:59:35 +08:00
Plachta f735bc3d80 upload files 2023-02-16 19:54:14 +08:00
Plachta 018a0aa0d8 upload files 2023-02-16 19:46:23 +08:00
Plachta 6e8d3255a1 upload files 2023-02-16 19:40:47 +08:00
Plachta e12928fac9 upload files 2023-02-16 18:50:02 +08:00
Plachta f2287032e8 upload files 2023-02-16 18:33:06 +08:00
Plachta a1b6eb54e4 upload files 2023-02-16 18:29:06 +08:00
Plachta bacf4dbdff upload files 2023-02-16 18:01:33 +08:00
Plachta e160e1532f upload files 2023-02-16 17:57:41 +08:00
Plachta 3a6bf5adbb upload files 2023-02-16 17:41:23 +08:00
Plachta eed75db80d upload files 2023-02-16 17:09:40 +08:00
Plachta c9b8a0a446 upload files 2023-02-16 16:39:34 +08:00
Plachta 5031a30f1e upload files 2023-02-16 16:13:36 +08:00
Plachta 611965cb43 upload files 2023-02-16 15:58:30 +08:00
Plachta c7074471b1 upload files 2023-02-16 15:58:00 +08:00
Plachta 4e99bbd3c9 upload files 2023-02-16 15:57:28 +08:00
Plachta d60c12e9e5 upload files 2023-02-16 15:56:03 +08:00
Plachta 8d0698261c upload files 2023-02-15 17:30:17 +08:00
Plachta 6986709683 upload files 2023-02-15 17:18:24 +08:00
Plachta 71721fb7fa upload files 2023-02-15 16:45:35 +08:00
Plachta f72d72da5d upload files 2023-02-15 16:44:01 +08:00
Plachta 6246f7718d upload files 2023-02-15 16:42:50 +08:00
46 changed files with 1793 additions and 1303 deletions
+162
View File
@@ -0,0 +1,162 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
-3
View File
@@ -1,3 +0,0 @@
# Default ignored files
/shelf/
/workspace.xml
-12
View File
@@ -1,12 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<module type="PYTHON_MODULE" version="4">
<component name="NewModuleRootManager">
<content url="file://$MODULE_DIR$" />
<orderEntry type="jdk" jdkName="Python 3.7 (VITS)" jdkType="Python SDK" />
<orderEntry type="sourceFolder" forTests="false" />
</component>
<component name="PyDocumentationSettings">
<option name="format" value="PLAIN" />
<option name="myDocStringFormat" value="Plain" />
</component>
</module>
-154
View File
@@ -1,154 +0,0 @@
<component name="InspectionProjectProfileManager">
<profile version="1.0">
<option name="myName" value="Project Default" />
<inspection_tool class="PyPackageRequirementsInspection" enabled="true" level="WARNING" enabled_by_default="true">
<option name="ignoredPackages">
<value>
<list size="132">
<item index="0" class="java.lang.String" itemvalue="ccxt" />
<item index="1" class="java.lang.String" itemvalue="lz4" />
<item index="2" class="java.lang.String" itemvalue="pre-commit" />
<item index="3" class="java.lang.String" itemvalue="elegantrl" />
<item index="4" class="java.lang.String" itemvalue="setuptools" />
<item index="5" class="java.lang.String" itemvalue="ray" />
<item index="6" class="java.lang.String" itemvalue="gputil" />
<item index="7" class="java.lang.String" itemvalue="google-pasta" />
<item index="8" class="java.lang.String" itemvalue="tensorflow-estimator" />
<item index="9" class="java.lang.String" itemvalue="scikit-learn" />
<item index="10" class="java.lang.String" itemvalue="tabulate" />
<item index="11" class="java.lang.String" itemvalue="multitasking" />
<item index="12" class="java.lang.String" itemvalue="pickleshare" />
<item index="13" class="java.lang.String" itemvalue="pyasn1-modules" />
<item index="14" class="java.lang.String" itemvalue="ipython-genutils" />
<item index="15" class="java.lang.String" itemvalue="Pygments" />
<item index="16" class="java.lang.String" itemvalue="mccabe" />
<item index="17" class="java.lang.String" itemvalue="astunparse" />
<item index="18" class="java.lang.String" itemvalue="lxml" />
<item index="19" class="java.lang.String" itemvalue="Werkzeug" />
<item index="20" class="java.lang.String" itemvalue="tensorboard-data-server" />
<item index="21" class="java.lang.String" itemvalue="jupyter-client" />
<item index="22" class="java.lang.String" itemvalue="pexpect" />
<item index="23" class="java.lang.String" itemvalue="click" />
<item index="24" class="java.lang.String" itemvalue="ipykernel" />
<item index="25" class="java.lang.String" itemvalue="pandas-datareader" />
<item index="26" class="java.lang.String" itemvalue="psutil" />
<item index="27" class="java.lang.String" itemvalue="jedi" />
<item index="28" class="java.lang.String" itemvalue="regex" />
<item index="29" class="java.lang.String" itemvalue="tensorboard" />
<item index="30" class="java.lang.String" itemvalue="platformdirs" />
<item index="31" class="java.lang.String" itemvalue="matplotlib" />
<item index="32" class="java.lang.String" itemvalue="idna" />
<item index="33" class="java.lang.String" itemvalue="rsa" />
<item index="34" class="java.lang.String" itemvalue="decorator" />
<item index="35" class="java.lang.String" itemvalue="numpy" />
<item index="36" class="java.lang.String" itemvalue="pyasn1" />
<item index="37" class="java.lang.String" itemvalue="requests" />
<item index="38" class="java.lang.String" itemvalue="tensorflow" />
<item index="39" class="java.lang.String" itemvalue="tensorboard-plugin-wit" />
<item index="40" class="java.lang.String" itemvalue="Deprecated" />
<item index="41" class="java.lang.String" itemvalue="nest-asyncio" />
<item index="42" class="java.lang.String" itemvalue="prompt-toolkit" />
<item index="43" class="java.lang.String" itemvalue="keras-tuner" />
<item index="44" class="java.lang.String" itemvalue="scipy" />
<item index="45" class="java.lang.String" itemvalue="dataclasses" />
<item index="46" class="java.lang.String" itemvalue="tornado" />
<item index="47" class="java.lang.String" itemvalue="google-auth-oauthlib" />
<item index="48" class="java.lang.String" itemvalue="black" />
<item index="49" class="java.lang.String" itemvalue="toml" />
<item index="50" class="java.lang.String" itemvalue="Quandl" />
<item index="51" class="java.lang.String" itemvalue="pandas" />
<item index="52" class="java.lang.String" itemvalue="termcolor" />
<item index="53" class="java.lang.String" itemvalue="pylint" />
<item index="54" class="java.lang.String" itemvalue="typing_extensions" />
<item index="55" class="java.lang.String" itemvalue="cachetools" />
<item index="56" class="java.lang.String" itemvalue="debugpy" />
<item index="57" class="java.lang.String" itemvalue="isort" />
<item index="58" class="java.lang.String" itemvalue="pytz" />
<item index="59" class="java.lang.String" itemvalue="inflection" />
<item index="60" class="java.lang.String" itemvalue="Pillow" />
<item index="61" class="java.lang.String" itemvalue="traitlets" />
<item index="62" class="java.lang.String" itemvalue="absl-py" />
<item index="63" class="java.lang.String" itemvalue="protobuf" />
<item index="64" class="java.lang.String" itemvalue="joblib" />
<item index="65" class="java.lang.String" itemvalue="threadpoolctl" />
<item index="66" class="java.lang.String" itemvalue="opt-einsum" />
<item index="67" class="java.lang.String" itemvalue="python-dateutil" />
<item index="68" class="java.lang.String" itemvalue="gpflow" />
<item index="69" class="java.lang.String" itemvalue="astroid" />
<item index="70" class="java.lang.String" itemvalue="cycler" />
<item index="71" class="java.lang.String" itemvalue="gast" />
<item index="72" class="java.lang.String" itemvalue="kt-legacy" />
<item index="73" class="java.lang.String" itemvalue="appdirs" />
<item index="74" class="java.lang.String" itemvalue="tensorflow-probability" />
<item index="75" class="java.lang.String" itemvalue="pip" />
<item index="76" class="java.lang.String" itemvalue="pyzmq" />
<item index="77" class="java.lang.String" itemvalue="certifi" />
<item index="78" class="java.lang.String" itemvalue="oauthlib" />
<item index="79" class="java.lang.String" itemvalue="pyparsing" />
<item index="80" class="java.lang.String" itemvalue="Markdown" />
<item index="81" class="java.lang.String" itemvalue="h5py" />
<item index="82" class="java.lang.String" itemvalue="wrapt" />
<item index="83" class="java.lang.String" itemvalue="kiwisolver" />
<item index="84" class="java.lang.String" itemvalue="empyrical" />
<item index="85" class="java.lang.String" itemvalue="backcall" />
<item index="86" class="java.lang.String" itemvalue="charset-normalizer" />
<item index="87" class="java.lang.String" itemvalue="multipledispatch" />
<item index="88" class="java.lang.String" itemvalue="pathspec" />
<item index="89" class="java.lang.String" itemvalue="jupyter-core" />
<item index="90" class="java.lang.String" itemvalue="matplotlib-inline" />
<item index="91" class="java.lang.String" itemvalue="ptyprocess" />
<item index="92" class="java.lang.String" itemvalue="more-itertools" />
<item index="93" class="java.lang.String" itemvalue="mypy-extensions" />
<item index="94" class="java.lang.String" itemvalue="cloudpickle" />
<item index="95" class="java.lang.String" itemvalue="wcwidth" />
<item index="96" class="java.lang.String" itemvalue="requests-oauthlib" />
<item index="97" class="java.lang.String" itemvalue="Keras-Preprocessing" />
<item index="98" class="java.lang.String" itemvalue="yfinance" />
<item index="99" class="java.lang.String" itemvalue="tomli" />
<item index="100" class="java.lang.String" itemvalue="urllib3" />
<item index="101" class="java.lang.String" itemvalue="six" />
<item index="102" class="java.lang.String" itemvalue="parso" />
<item index="103" class="java.lang.String" itemvalue="wheel" />
<item index="104" class="java.lang.String" itemvalue="ipython" />
<item index="105" class="java.lang.String" itemvalue="packaging" />
<item index="106" class="java.lang.String" itemvalue="lazy-object-proxy" />
<item index="107" class="java.lang.String" itemvalue="grpcio" />
<item index="108" class="java.lang.String" itemvalue="dm-tree" />
<item index="109" class="java.lang.String" itemvalue="google-auth" />
<item index="110" class="java.lang.String" itemvalue="seaborn" />
<item index="111" class="java.lang.String" itemvalue="thop" />
<item index="112" class="java.lang.String" itemvalue="torch" />
<item index="113" class="java.lang.String" itemvalue="torchvision" />
<item index="114" class="java.lang.String" itemvalue="d2l" />
<item index="115" class="java.lang.String" itemvalue="keyboard" />
<item index="116" class="java.lang.String" itemvalue="transformers" />
<item index="117" class="java.lang.String" itemvalue="phonemizer" />
<item index="118" class="java.lang.String" itemvalue="Unidecode" />
<item index="119" class="java.lang.String" itemvalue="nltk" />
<item index="120" class="java.lang.String" itemvalue="pinecone-client" />
<item index="121" class="java.lang.String" itemvalue="sentence-transformers" />
<item index="122" class="java.lang.String" itemvalue="whisper" />
<item index="123" class="java.lang.String" itemvalue="datasets" />
<item index="124" class="java.lang.String" itemvalue="pyaudio" />
<item index="125" class="java.lang.String" itemvalue="torchsummary" />
<item index="126" class="java.lang.String" itemvalue="openjtalk" />
<item index="127" class="java.lang.String" itemvalue="hydra-core" />
<item index="128" class="java.lang.String" itemvalue="museval" />
<item index="129" class="java.lang.String" itemvalue="mypy" />
<item index="130" class="java.lang.String" itemvalue="hydra-colorlog" />
<item index="131" class="java.lang.String" itemvalue="flake8" />
</list>
</value>
</option>
</inspection_tool>
<inspection_tool class="PyUnresolvedReferencesInspection" enabled="true" level="WARNING" enabled_by_default="true">
<option name="ignoredIdentifiers">
<list>
<option value="sentiment_classification.model_predictions.audio_path" />
<option value="sentiment_classification.model_predictions.sample_rate" />
<option value="sentiment_classification.model_predictions.num_samples" />
</list>
</option>
</inspection_tool>
</profile>
</component>
-6
View File
@@ -1,6 +0,0 @@
<component name="InspectionProjectProfileManager">
<settings>
<option name="USE_PROJECT_PROFILE" value="false" />
<version value="1.0" />
</settings>
</component>
-4
View File
@@ -1,4 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="ProjectRootManager" version="2" project-jdk-name="Python 3.7 (VITS)" project-jdk-type="Python SDK" />
</project>
-8
View File
@@ -1,8 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="ProjectModuleManager">
<modules>
<module fileurl="file://$PROJECT_DIR$/.idea/VITS_voice_conversion.iml" filepath="$PROJECT_DIR$/.idea/VITS_voice_conversion.iml" />
</modules>
</component>
</project>
Generated
-6
View File
@@ -1,6 +0,0 @@
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="VcsDirectoryMappings">
<mapping directory="$PROJECT_DIR$" vcs="Git" />
</component>
</project>
+42
View File
@@ -0,0 +1,42 @@
本仓库的pipeline支持多种声音样本上传方式,您只需根据您所持有的样本选择任意一种或其中几种即可。
1.`.zip`文件打包的,按角色名排列的短音频,该压缩文件结构应如下所示:
```
Your-zip-file.zip
├───Character_name_1
├ ├───xxx.wav
├ ├───...
├ ├───yyy.mp3
├ └───zzz.wav
├───Character_name_2
├ ├───xxx.wav
├ ├───...
├ ├───yyy.mp3
├ └───zzz.wav
├───...
└───Character_name_n
├───xxx.wav
├───...
├───yyy.mp3
└───zzz.wav
```
注意音频的格式和名称都不重要,只要它们是音频文件。
质量要求:2秒以上,10秒以内,尽量不要有背景噪音。
数量要求:一个角色至少10条,最好每个角色20条以上。
2. 以角色名命名的长音频文件,音频内只能有单说话人,背景音会被自动去除。命名格式为:`{CharacterName}_{random_number}.wav`
(例如:`Diana_234135.wav`, `MinatoAqua_234252.wav`),必须是`.wav`文件,长度要在20分钟以内(否则会内存不足)。
3. 以角色名命名的长视频文件,视频内只能有单说话人,背景音会被自动去除。命名格式为:`{CharacterName}_{random_number}.mp4`
(例如:`Taffy_332452.mp4`, `Dingzhen_957315.mp4`),必须是`.mp4`文件,长度要在20分钟以内(否则会内存不足)。
注意:命名中,`CharacterName`必须是英文字符,`random_number`是为了区分同一个角色的多个文件,必须要添加,该数字可以为0~999999之间的任意整数。
4. 包含多行`{CharacterName}|{video_url}``.txt`文件,格式应如下所示:
```
Char1|https://xyz.com/video1/
Char2|https://xyz.com/video2/
Char2|https://xyz.com/video3/
Char3|https://xyz.com/video4/
```
视频内只能有单说话人,背景音会被自动去除。目前仅支持来自bilibili的视频,其它网站视频的url还没测试过。
若对格式有疑问,可以在[这里](https://drive.google.com/file/d/132l97zjanpoPY4daLgqXoM7HKXPRbS84/view?usp=sharing)找到所有格式对应的数据样本。
+47
View File
@@ -0,0 +1,47 @@
The pipeline of this repo supports multiple voice uploading optionsyou can choose one or more options depending on the data you have.
1. Short audios packed by a single `.zip` file, whose file structure should be as shown below:
```
Your-zip-file.zip
├───Character_name_1
├ ├───xxx.wav
├ ├───...
├ ├───yyy.mp3
├ └───zzz.wav
├───Character_name_2
├ ├───xxx.wav
├ ├───...
├ ├───yyy.mp3
├ └───zzz.wav
├───...
└───Character_name_n
├───xxx.wav
├───...
├───yyy.mp3
└───zzz.wav
```
Note that the format of the audio files does not matter as long as they are audio files。
Quality requirement: >=2s, <=10s, contain as little background sound as possible.
Quantity requirement: at least 10 per character, 20+ per character is recommended.
2. Long audio files named by character names, which should contain single character voice only. Background sound is
acceptable since they will be automatically removed. File name format `{CharacterName}_{random_number}.wav`
(E.G. `Diana_234135.wav`, `MinatoAqua_234252.wav`), must be `.wav` files.
3. Long video files named by character names, which should contain single character voice only. Background sound is
acceptable since they will be automatically removed. File name format `{CharacterName}_{random_number}.mp4`
(E.G. `Taffy_332452.mp4`, `Dingzhen_957315.mp4`), must be `.mp4` files.
Note: `CharacterName` must be English characters only, `random_number` is to identify multiple files for one character,
which is compulsory to add. It could be a random integer between 0~999999.
4. A `.txt` containing multiple lines of`{CharacterName}|{video_url}`, which should be formatted as follows:
```
Char1|https://xyz.com/video1/
Char2|https://xyz.com/video2/
Char2|https://xyz.com/video3/
Char3|https://xyz.com/video4/
```
One video should contain single speaker only. Currently supports videos links from bilibili, other websites are yet to be tested.
Having questions regarding to data format? Fine data samples of all format from [here](https://drive.google.com/file/d/132l97zjanpoPY4daLgqXoM7HKXPRbS84/view?usp=sharing).
+118
View File
@@ -0,0 +1,118 @@
# Train locally
### Build environment
0. Make sure you have installed `Python==3.8`, CMake & C/C++ compilers, ffmpeg;
1. Clone this repository;
2. Run `pip install -r requirements.txt`;
3. Install GPU version PyTorch: (Make sure you have CUDA 11.6 or 11.7 installed)
```
# CUDA 11.6
pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
# CUDA 11.7
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
```
4. Install necessary libraries for dealing video data:
```
pip install imageio==2.4.1
pip install moviepy
```
5. Build monotonic align (necessary for training)
```
cd monotonic_align
mkdir monotonic_align
python setup.py build_ext --inplace
cd ..
```
6. Download auxiliary data for training
```
mkdir pretrained_models
# download data for fine-tuning
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/sampled_audio4ft_v2.zip
unzip sampled_audio4ft_v2.zip
# create necessary directories
mkdir video_data
mkdir raw_audio
mkdir denoised_audio
mkdir custom_character_voice
mkdir segmented_character_voice
```
7. Download pretrained model, available options are:
```
CJE: Trilingual (Chinese, Japanese, English)
CJ: Dualigual (Chinese, Japanese)
C: Chinese only
```
### Linux
To download `CJE` model, run the following:
```
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/D_trilingual.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/pretrained_models/G_trilingual.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer/resolve/main/configs/uma_trilingual.json -O ./configs/finetune_speaker.json
```
To download `CJ` model, run the following:
```
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/D_0-p.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/G_0-p.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai/resolve/main/model/config.json -O ./configs/finetune_speaker.json
```
To download `C` model, run the follwoing:
```
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/D_0.pth -O ./pretrained_models/D_0.pth
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/G_0.pth -O ./pretrained_models/G_0.pth
wget https://huggingface.co/datasets/Plachta/sampled_audio4ft/resolve/main/VITS-Chinese/config.json -O ./configs/finetune_speaker.json
```
### Windows
Manually download `G_0.pth`, `D_0.pth`, `finetune_speaker.json` from the URLs in one of the options described above.
Rename all `G` models to `G_0.pth`, `D` models to `D_0.pth`, config files (`.json`) to `finetune_speaker.json`.
Put `G_0.pth`, `D_0.pth` under `pretrained_models` directory;
Put `finetune_speaker.json` under `configs` directory
#### Please note that when you download one of them, the previous model will be overwritten.
9. Put your voice data under corresponding directories, see [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD) for detailed different uploading options.
### Short audios
1. Prepare your data according to [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD) as a single `.zip` file;
2. Put your file under directory `./custom_character_voice/`;
3. run `unzip ./custom_character_voice/custom_character_voice.zip -d ./custom_character_voice/`
### Long audios
1. Name your audio files according to [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD);
2. Put your renamed audio files under directory `./raw_audio/`
### Videos
1. Name your video files according to [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD);
2. Put your renamed video files under directory `./video_data/`
10. Process all audio data.
```
python scripts/video2audio.py
python scripts/denoise_audio.py
python scripts/long_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
python scripts/short_audio_transcribe.py --languages "{PRETRAINED_MODEL}" --whisper_size large
python scripts/resample.py
```
Replace `"{PRETRAINED_MODEL}"` with one of `{CJ, CJE, C}` according to your previous model choice.
Make sure you have a minimum GPU memory of 12GB. If not, change the argument `whisper_size` to `medium` or `small`.
10. Process all text data.
If you choose to add auxiliary data, run `python preprocess_v2.py --add_auxiliary_data True --languages "{PRETRAINED_MODEL}"`
If not, run `python preprocess_v2.py --languages "{PRETRAINED_MODEL}"`
Do replace `"{PRETRAINED_MODEL}"` with one of `{CJ, CJE, C}` according to your previous model choice.
11. Start Training.
Run `python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed True`
Do replace `{Maximum_epochs}` with your desired number of epochs. Empirically, 100 or more is recommended.
To continue training on previous checkpoint, change the training command to: `python finetune_speaker_v2.py -m ./OUTPUT_MODEL --max_epochs "{Maximum_epochs}" --drop_speaker_embed False --cont True`. Before you do this, make sure you have previous `G_latest.pth` and `D_latest.pth` under `./OUTPUT_MODEL/` directory.
To view training progress, open a new terminal and `cd` to the project root directory, run `tensorboard --logdir=./OUTPUT_MODEL`, then visit `localhost:6006` with your web browser.
12. After training is completed, you can use your model by running:
`python VC_inference.py --model_dir ./OUTPUT_MODEL/G_latest.pth --share True`
13. To clear all audio data, run:
### Linux
```
rm -rf ./custom_character_voice/* ./video_data/* ./raw_audio/* ./denoised_audio/* ./segmented_character_voice/* ./separated/* long_character_anno.txt short_character_anno.txt
```
### Windows
```
del /Q /S .\custom_character_voice\* .\video_data\* .\raw_audio\* .\denoised_audio\* .\segmented_character_voice\* .\separated\* long_character_anno.txt short_character_anno.txt
```
+50 -28
View File
@@ -1,40 +1,62 @@
[中文文档请点击这里](https://github.com/SongtingLiu/VITS_voice_conversion/blob/main/README_CN.md)
# VITS Voice Conversion
This repo will guide you to add your voice into an existing VITS TTS model
to make it a high-quality voice converter to all existing character voices in the model.
[中文文档请点击这里](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/README_ZH.md)
# VITS Fast Fine-tuning
This repo will guide you to add your own character voices, or even your own voice, into existing VITS TTS model
to make it able to do the following tasks in less than 1 hour:
1. Many-to-many voice conversion between any characters you added & preset characters in the model.
2. English, Japanese & Chinese Text-to-Speech synthesis with the characters you added & preset characters
Welcome to play around with the base models!
Chinese & English & Japanese[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer) Author: Me
Chinese & Japanese[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai) Author: [SayaSS](https://github.com/SayaSS)
Chinese only(No running huggingface spaces) Author: [Wwwwhy230825](https://github.com/Wwwwhy230825)
Welcome to play around with the base model, a Trilingual Anime VITS!
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer)
### Currently Supported Tasks:
- [x] Convert user's voice to characters listed [here](https://github.com/SongtingLiu/VITS_voice_conversion/blob/main/configs/finetune_speaker.json)
- [x] Chinese, English, Japanese TTS with user's voice
- [ ] Chinese, English, Japanese TTS with custom characters...
- [x] Clone character voice from 10+ short audios
- [x] Clone character voice from long audio(s) >= 3 minutes (one audio should contain single speaker only)
- [x] Clone character voice from videos(s) >= 3 minutes (one video should contain single speaker only)
- [x] Clone character voice from BILIBILI video links (one video should contain single speaker only)
### Currently Supported Characters for TTS & VC:
- [x] Umamusume Pretty Derby
- [x] Sanoba Witch
- [x] Genshin Impact
- [ ] Custom characters...
- [x] Any character you wish as long as you have their voices!
(Note that voice conversion can only be conducted between any two speakers in the model)
## Fine-tuning
It's recommended to perform fine-tuning on [Google Colab](https://colab.research.google.com/drive/1omMhfYKrAAQ7a6zOCsyqpla-wU-QyfZn?usp=sharing)
because the original VITS has some dependencies that are difficult to configure.
See [LOCAL.md](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/LOCAL.md) for local training guide.
Alternatively, you can perform fine-tuning on [Google Colab](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing)
### How long does it take?
1. Install dependencies (2 min)
2. Record at least 10 your own voice (5 min)
3. Fine-tune (30 min)
## Inference or Usage
### How long does it take?
1. Install dependencies (3 min)
2. Choose pretrained model to start. The detailed differences between them are described in [Colab Notebook](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing)
3. Upload the voice samples of the characters you wish to addsee [DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA_EN.MD) for detailed uploading options.
4. Start fine-tuning. Time taken varies from 20 minutes ~ 2 hours, depending on the number of voices you uploaded.
1. Install Python if you haven't done so (Python >= 3.7)
2. Clone this repo:
`git clone https://github.com/SongtingLiu/VITS_voice_conversion.git`
3. Install dependencies
`pip install -r requirements_infer.txt`
4. run VC_inference.py
`python VC_inference.py`
## Inference or Usage (Currently support Windows only)
0. Remember to download your fine-tuned model!
1. Download the latest release
2. Put your model & config file into the folder `inference`, which are named `G_latest.pth` and `finetune_speaker.json`, respectively.
3. The file structure should be as follows:
```
inference
├───inference.exe
├───...
├───finetune_speaker.json
└───G_latest.pth
```
4. run `inference.exe`, the browser should pop up automatically.
5. Note: you must install `ffmpeg` to enable voice conversion feature.
## Use in MoeGoe
0. Prepare downloaded model & config file, which are named `G_latest.pth` and `moegoe_config.json`, respectively.
1. Follow [MoeGoe](https://github.com/CjangCjengh/MoeGoe) page instructions to install, configure path, and use.
## Looking for help?
If you have any questions, please feel free to open an [issue](https://github.com/Plachtaa/VITS-fast-fine-tuning/issues/new) or join our [Discord](https://discord.gg/TcrjDFvm5A) server.
-39
View File
@@ -1,39 +0,0 @@
# VITS 声线转换
这个代码库会指导你如何将自己的声线通过微调加入已有的VITS模型中,从而使得一个模型就可以实现用户声线到上百个角色声线的高质量转换。
欢迎体验微调所使用的底模,一个包含中日英三语的TTS(文本到语音合成)模型!
[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer)
### 目前支持的任务:
- [x] 转换用户声线到 [这些角色](https://github.com/SongtingLiu/VITS_voice_conversion/blob/main/configs/finetune_speaker.json)
- [ ] 自定义角色的中日英三语TTS(待完成)
### 目前支持声线转换和中日英三语TTS的角色
- [x] 赛马娘 (仅已实装角色)
- [x] 魔女的夜宴(柚子社) 5人)
- [x] 原神 (仅已实装角色)
- [ ] 任意角色(待完成)
## 微调
建议使用 [Google Colab](https://colab.research.google.com/drive/1omMhfYKrAAQ7a6zOCsyqpla-wU-QyfZn?usp=sharing)
进行微调任务,因为VITS在多语言情况下的某些环境依赖相当难以配置。
### 在Google Colab里,我需要花多长时间?
1. 安装依赖 (2 min)
2. 录入你自己的声音,至少20条3~4秒的短句 (5 min)
3. 进行微调 (30 min)
微调结束后可以直接下载微调好的模型,日后在本地运行(不需要GPU)
## 本地运行和推理
1. Install Python if you haven't done so (Python >= 3.7)
2. Clone this repo:
`git clone https://github.com/SongtingLiu/VITS_voice_conversion.git`
3. Install dependencies
`pip install -r requirements_infer.txt`
4. run VC_inference.py
`python VC_inference.py`
+66
View File
@@ -0,0 +1,66 @@
English Documentation Please Click [here](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/README.md)
# VITS 快速微调
这个代码库会指导你如何将自定义角色(甚至你自己),加入预训练的VITS模型中,在1小时内的微调使模型具备如下功能:
1. 在 模型所包含的任意两个角色 之间进行声线转换
2. 以 你加入的角色声线 进行中日英三语 文本到语音合成。
本项目使用的底模涵盖常见二次元男/女配音声线(来自原神数据集)以及现实世界常见男/女声线(来自VCTK数据集),支持中日英三语,保证能够在微调时快速适应新的声线。
欢迎体验微调所使用的底模!
中日英:[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Plachta/VITS-Umamusume-voice-synthesizer) 作者:我
中日:[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/sayashi/vits-uma-genshin-honkai) 作者:[SayaSS](https://github.com/SayaSS)
纯中文:(没有huggingface demo)作者:[Wwwwhy230825](https://github.com/Wwwwhy230825)
### 目前支持的任务:
- [x] 从 10条以上的短音频 克隆角色声音
- [x] 从 3分钟以上的长音频(单个音频只能包含单说话人) 克隆角色声音
- [x] 从 3分钟以上的视频(单个视频只能包含单说话人) 克隆角色声音
- [x] 通过输入 bilibili视频链接(单个视频只能包含单说话人) 克隆角色声音
### 目前支持声线转换和中日英三语TTS的角色
- [x] 任意角色(只要你有角色的声音样本)
(注意:声线转换只能在任意两个存在于模型中的说话人之间进行)
## 微调
若希望于本地机器进行训练,请参考[LOCAL.md](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/LOCAL.md)以进行。
另外,也可以选择使用 [Google Colab](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing)
进行微调任务。
### 我需要花多长时间?
1. 安装依赖 (10 min在Google Colab中)
2. 选择预训练模型,详细区别参见[Colab 笔记本页面](https://colab.research.google.com/drive/1pn1xnFfdLK63gVXDwV4zCXfVeo8c-I-0?usp=sharing)。
3. 上传你希望加入的其它角色声音,详细上传方式见[DATA.MD](https://github.com/Plachtaa/VITS-fast-fine-tuning/blob/main/DATA.MD)
4. 进行微调,根据选择的微调方式和样本数量不同,花费时长可能在20分钟到2小时不等。
微调结束后可以直接下载微调好的模型,日后在本地运行(不需要GPU)
## 本地运行和推理
0. 记得下载微调好的模型和config文件!
1. 下载最新的Release包(在Github页面的右侧)
2. 把下载的模型和config文件放在 `inference`文件夹下, 其文件名分别为 `G_latest.pth``finetune_speaker.json`
3. 一切准备就绪后,文件结构应该如下所示:
```
inference
├───inference.exe
├───...
├───finetune_speaker.json
└───G_latest.pth
```
4. 运行 `inference.exe`, 浏览器会自动弹出窗口, 注意其所在路径不能有中文字符或者空格.
5. 请注意,声线转换功能需要安装`ffmpeg`才能正常使用.
## 在MoeGoe使用
0. MoeGoe以及类似其它VITS推理UI使用的config格式略有不同,需要下载的文件为模型`G_latest.pth`和配置文件`moegoe_config.json`
1. 按照[MoeGoe](https://github.com/CjangCjengh/MoeGoe)页面的提示配置路径即可使用。
2. MoeGoe在输入句子时需要使用相应的语言标记包裹句子才能正常合成。(日语用[JA], 中文用[ZH], 英文用[EN]),例如:
[JA]こんにちわ。[JA]
[ZH]你好![ZH]
[EN]Hello![EN]
## 帮助
如果你在使用过程中遇到了任何问题,可以在[这里](https://github.com/Plachtaa/VITS-fast-fine-tuning/issues/new)开一个issue,或者加入Discord服务器寻求帮助:[Discord](https://discord.gg/TcrjDFvm5A)。
+82 -16
View File
@@ -3,13 +3,57 @@ import numpy as np
import torch
from torch import no_grad, LongTensor
import argparse
import commons
from mel_processing import spectrogram_torch
import utils
from models_infer import SynthesizerTrn
from models import SynthesizerTrn
import gradio as gr
import librosa
import webbrowser
from text import text_to_sequence, _clean_text
device = "cuda:0" if torch.cuda.is_available() else "cpu"
import logging
logging.getLogger("PIL").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
logging.getLogger("markdown_it").setLevel(logging.WARNING)
logging.getLogger("httpx").setLevel(logging.WARNING)
logging.getLogger("asyncio").setLevel(logging.WARNING)
language_marks = {
"Japanese": "",
"日本語": "[JA]",
"简体中文": "[ZH]",
"English": "[EN]",
"Mix": "",
}
lang = ['日本語', '简体中文', 'English', 'Mix']
def get_text(text, hps, is_symbol):
text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners)
if hps.data.add_blank:
text_norm = commons.intersperse(text_norm, 0)
text_norm = LongTensor(text_norm)
return text_norm
def create_tts_fn(model, hps, speaker_ids):
def tts_fn(text, speaker, language, speed):
if language is not None:
text = language_marks[language] + text + language_marks[language]
speaker_id = speaker_ids[speaker]
stn_tst = get_text(text, hps, False)
with no_grad():
x_tst = stn_tst.unsqueeze(0).to(device)
x_tst_lengths = LongTensor([stn_tst.size(0)]).to(device)
sid = LongTensor([speaker_id]).to(device)
audio = model.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8,
length_scale=1.0 / speed)[0][0, 0].data.cpu().float().numpy()
del stn_tst, x_tst, x_tst_lengths, sid
return "Success", (hps.data.sampling_rate, audio)
return tts_fn
def create_vc_fn(model, hps, speaker_ids):
def vc_fn(original_speaker, target_speaker, record_audio, upload_audio):
@@ -42,6 +86,8 @@ def create_vc_fn(model, hps, speaker_ids):
return "Success", (hps.data.sampling_rate, audio)
return vc_fn
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model_dir", default="./G_latest.pth", help="directory to your fine-tuned model")
@@ -63,23 +109,43 @@ if __name__ == "__main__":
_ = utils.load_checkpoint(args.model_dir, net_g, None)
speaker_ids = hps.speakers
speakers = list(hps.speakers.keys())
tts_fn = create_tts_fn(net_g, hps, speaker_ids)
vc_fn = create_vc_fn(net_g, hps, speaker_ids)
app = gr.Blocks()
with app:
gr.Markdown("""
录制或上传声音,并选择要转换的音色。User代表的音色是你自己。
""")
with gr.Column():
record_audio = gr.Audio(label="record your voice", source="microphone")
upload_audio = gr.Audio(label="or upload audio here", source="upload")
source_speaker = gr.Dropdown(choices=speakers, value="User", label="source speaker")
target_speaker = gr.Dropdown(choices=speakers, value=speakers[0], label="target speaker")
with gr.Column():
message_box = gr.Textbox(label="Message")
converted_audio = gr.Audio(label='converted audio')
btn = gr.Button("Convert!")
btn.click(vc_fn, inputs=[source_speaker, target_speaker, record_audio, upload_audio],
outputs=[message_box, converted_audio])
with gr.Tab("Text-to-Speech"):
with gr.Row():
with gr.Column():
textbox = gr.TextArea(label="Text",
placeholder="Type your sentence here",
value="こんにちわ。", elem_id=f"tts-input")
# select character
char_dropdown = gr.Dropdown(choices=speakers, value=speakers[0], label='character')
language_dropdown = gr.Dropdown(choices=lang, value=lang[0], label='language')
duration_slider = gr.Slider(minimum=0.1, maximum=5, value=1, step=0.1,
label='速度 Speed')
with gr.Column():
text_output = gr.Textbox(label="Message")
audio_output = gr.Audio(label="Output Audio", elem_id="tts-audio")
btn = gr.Button("Generate!")
btn.click(tts_fn,
inputs=[textbox, char_dropdown, language_dropdown, duration_slider,],
outputs=[text_output, audio_output])
with gr.Tab("Voice Conversion"):
gr.Markdown("""
录制或上传声音,并选择要转换的音色。
""")
with gr.Column():
record_audio = gr.Audio(label="record your voice", source="microphone")
upload_audio = gr.Audio(label="or upload audio here", source="upload")
source_speaker = gr.Dropdown(choices=speakers, value=speakers[0], label="source speaker")
target_speaker = gr.Dropdown(choices=speakers, value=speakers[0], label="target speaker")
with gr.Column():
message_box = gr.Textbox(label="Message")
converted_audio = gr.Audio(label='converted audio')
btn = gr.Button("Convert!")
btn.click(vc_fn, inputs=[source_speaker, target_speaker, record_audio, upload_audio],
outputs=[message_box, converted_audio])
webbrowser.open("http://127.0.0.1:7860")
app.launch(share=args.share)
+132
View File
@@ -0,0 +1,132 @@
import argparse
import io
import json
from json import JSONDecodeError
from pathlib import Path
from urllib.parse import parse_qs
import soundfile as sf
import torch
import uvicorn
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import StreamingResponse
from hypy_utils.logging_utils import setup_logger
from starlette.middleware.cors import CORSMiddleware
from torch import no_grad, LongTensor
import commons
import utils
from models import SynthesizerTrn
from text import text_to_sequence
log = setup_logger()
app = FastAPI()
device = "cuda:0" if torch.cuda.is_available() else "cpu"
language_marks = {
"日本語": "[JA]",
"简体中文": "[ZH]",
"English": "[EN]",
"Mix": "",
}
# Allow all CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
def get_text(text: str, is_symbol: bool):
text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners)
if hps.data.add_blank:
text_norm = commons.intersperse(text_norm, 0)
text_norm = LongTensor(text_norm)
return text_norm
def tts_fn(text: str, speaker: str, language: str, speed: float):
if language is not None:
text = language_marks[language] + text + language_marks[language]
speaker_id = speaker_ids[speaker]
stn_tst = get_text(text, False)
with no_grad():
x_tst = stn_tst.unsqueeze(0).to(device)
x_tst_lengths = LongTensor([stn_tst.size(0)]).to(device)
sid = LongTensor([speaker_id]).to(device)
audio = model.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8,
length_scale=1.0 / speed)[0][0, 0].data.cpu().float().numpy()
del stn_tst, x_tst, x_tst_lengths, sid
return audio
@app.get("/tts/options")
async def get_options():
return {"speakers": list(speaker_ids.keys()), "languages": list(language_marks.keys())}
@app.post("/tts")
async def generate(request: Request):
body = (await request.body()).decode()
# Try parse json
if body.startswith('{'):
try:
data = json.loads(body)
except JSONDecodeError as e:
raise HTTPException(status_code=400, detail="Invalid JSON format")
# Try parse x-www-form-urlencoded
else:
data = parse_qs(body)
data = {k: v[0] for k, v in data.items()}
log.info(data)
text = data.get('text').strip().replace("\n", " ")
speaker = data.get('speaker')
language = data.get('language', '日本語')
speed = data.get('speed', 1.0)
if len(text) > 200:
raise HTTPException(status_code=400, detail="TL;DR")
if not text or not speaker or language not in language_marks:
raise HTTPException(status_code=400, detail="Invalid speaker or language (please check /tts/options)")
audio = tts_fn(text, speaker, language, speed)
audio_io = io.BytesIO()
# sf.write(audio_io, audio, hps.data.sampling_rate, format='OGG')
# Since safari don't support ogg, use mp3 instead
sf.write(audio_io, audio, hps.data.sampling_rate, format='MP3')
audio_io.seek(0)
return StreamingResponse(audio_io, media_type='audio/mpeg',
headers={'Content-Disposition': 'attachment; filename="output.mp3"'})
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-d", default="./OUTPUT_MODEL",
help="directory to your fine-tuned model (contains G_latest.pth and config.json)")
args = parser.parse_args()
d_config = Path(args.d) / "config.json"
d_model = Path(args.d) / "G_latest.pth"
hps = utils.get_hparams_from_file(d_config)
model = SynthesizerTrn(
len(hps.symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
n_speakers=hps.data.n_speakers,
**hps.model).to(device)
_ = model.eval()
utils.load_checkpoint(d_model, model, None)
speaker_ids = hps.speakers
uvicorn.run(app, host='0.0.0.0', port=27519)
+106
View File
@@ -0,0 +1,106 @@
"""该模块用于生成VITS文件
使用方法
python cmd_inference.py -m 模型路径 -c 配置文件路径 -o 输出文件路径 -l 输入的语言 -t 输入文本 -s 合成目标说话人名称
可选参数
-ns 感情变化程度
-nsw 音素发音长度
-ls 整体语速
-on 输出文件的名称
"""
from pathlib import Path
import utils
from models import SynthesizerTrn
import torch
from torch import no_grad, LongTensor
import librosa
from text import text_to_sequence, _clean_text
import commons
import scipy.io.wavfile as wavf
import os
device = "cuda:0" if torch.cuda.is_available() else "cpu"
language_marks = {
"Japanese": "",
"日本語": "[JA]",
"简体中文": "[ZH]",
"English": "[EN]",
"Mix": "",
}
def get_text(text, hps, is_symbol):
text_norm = text_to_sequence(text, hps.symbols, [] if is_symbol else hps.data.text_cleaners)
if hps.data.add_blank:
text_norm = commons.intersperse(text_norm, 0)
text_norm = LongTensor(text_norm)
return text_norm
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description='vits inference')
#必须参数
parser.add_argument('-m', '--model_path', type=str, default="logs/44k/G_0.pth", help='模型路径')
parser.add_argument('-c', '--config_path', type=str, default="configs/config.json", help='配置文件路径')
parser.add_argument('-o', '--output_path', type=str, default="output/vits", help='输出文件路径')
parser.add_argument('-l', '--language', type=str, default="日本語", help='输入的语言')
parser.add_argument('-t', '--text', type=str, help='输入文本')
parser.add_argument('-s', '--spk', type=str, help='合成目标说话人名称')
#可选参数
parser.add_argument('-on', '--output_name', type=str, default="output", help='输出文件的名称')
parser.add_argument('-ns', '--noise_scale', type=float,default= .667,help='感情变化程度')
parser.add_argument('-nsw', '--noise_scale_w', type=float,default=0.6, help='音素发音长度')
parser.add_argument('-ls', '--length_scale', type=float,default=1, help='整体语速')
args = parser.parse_args()
model_path = args.model_path
config_path = args.config_path
output_dir = Path(args.output_path)
output_dir.mkdir(parents=True, exist_ok=True)
language = args.language
text = args.text
spk = args.spk
noise_scale = args.noise_scale
noise_scale_w = args.noise_scale_w
length = args.length_scale
output_name = args.output_name
hps = utils.get_hparams_from_file(config_path)
net_g = SynthesizerTrn(
len(hps.symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
n_speakers=hps.data.n_speakers,
**hps.model).to(device)
_ = net_g.eval()
_ = utils.load_checkpoint(model_path, net_g, None)
speaker_ids = hps.speakers
if language is not None:
text = language_marks[language] + text + language_marks[language]
speaker_id = speaker_ids[spk]
stn_tst = get_text(text, hps, False)
with no_grad():
x_tst = stn_tst.unsqueeze(0).to(device)
x_tst_lengths = LongTensor([stn_tst.size(0)]).to(device)
sid = LongTensor([speaker_id]).to(device)
audio = net_g.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=noise_scale, noise_scale_w=noise_scale_w,
length_scale=1.0 / length)[0][0, 0].data.cpu().float().numpy()
del stn_tst, x_tst, x_tst_lengths, sid
wavf.write(str(output_dir)+"/"+output_name+".wav",hps.data.sampling_rate,audio)
-204
View File
@@ -1,204 +0,0 @@
{
"train": {
"log_interval": 100,
"eval_interval": 1000,
"seed": 1234,
"epochs": 10000,
"learning_rate": 2e-4,
"betas": [0.8, 0.99],
"eps": 1e-9,
"batch_size": 16,
"fp16_run": true,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0
},
"data": {
"training_files":"final_annotation_train.txt",
"validation_files":"final_annotation_val.txt",
"text_cleaners":["cjke_cleaners2"],
"max_wav_value": 32768.0,
"sampling_rate": 22050,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 1001,
"cleaned_text": true
},
"model": {
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [3,7,11],
"resblock_dilation_sizes": [[1,3,5], [1,3,5], [1,3,5]],
"upsample_rates": [8,8,2,2],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [16,16,4,4],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 256
},
"symbols": ["_", ",", ".", "!", "?", "-", "~", "\u2026", "N", "Q", "a", "b", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "s", "t", "u", "v", "w", "x", "y", "z", "\u0251", "\u00e6", "\u0283", "\u0291", "\u00e7", "\u026f", "\u026a", "\u0254", "\u025b", "\u0279", "\u00f0", "\u0259", "\u026b", "\u0265", "\u0278", "\u028a", "\u027e", "\u0292", "\u03b8", "\u03b2", "\u014b", "\u0266", "\u207c", "\u02b0", "`", "^", "#", "*", "=", "\u02c8", "\u02cc", "\u2192", "\u2193", "\u2191", " "],
"speakers": {"特别周 Special Week (Umamusume Pretty Derby)": 0,
"无声铃鹿 Silence Suzuka (Umamusume Pretty Derby)": 1,
"东海帝王 Tokai Teio (Umamusume Pretty Derby)": 2,
"丸善斯基 Maruzensky (Umamusume Pretty Derby)": 3,
"富士奇迹 Fuji Kiseki (Umamusume Pretty Derby)": 4,
"小栗帽 Oguri Cap (Umamusume Pretty Derby)": 5,
"黄金船 Gold Ship (Umamusume Pretty Derby)": 6,
"伏特加 Vodka (Umamusume Pretty Derby)": 7,
"大和赤骥 Daiwa Scarlet (Umamusume Pretty Derby)": 8,
"大树快车 Taiki Shuttle (Umamusume Pretty Derby)": 9,
"草上飞 Grass Wonder (Umamusume Pretty Derby)": 10,
"菱亚马逊 Hishi Amazon (Umamusume Pretty Derby)": 11,
"目白麦昆 Mejiro Mcqueen (Umamusume Pretty Derby)": 12,
"神鹰 El Condor Pasa (Umamusume Pretty Derby)": 13,
"好歌剧 T.M. Opera O (Umamusume Pretty Derby)": 14,
"成田白仁 Narita Brian (Umamusume Pretty Derby)": 15,
"鲁道夫象征 Symboli Rudolf (Umamusume Pretty Derby)": 16,
"气槽 Air Groove (Umamusume Pretty Derby)": 17,
"爱丽数码 Agnes Digital (Umamusume Pretty Derby)": 18,
"青云天空 Seiun Sky (Umamusume Pretty Derby)": 19,
"玉藻十字 Tamamo Cross (Umamusume Pretty Derby)": 20,
"美妙姿势 Fine Motion (Umamusume Pretty Derby)": 21,
"琵琶晨光 Biwa Hayahide (Umamusume Pretty Derby)": 22,
"重炮 Mayano Topgun (Umamusume Pretty Derby)": 23,
"曼城茶座 Manhattan Cafe (Umamusume Pretty Derby)": 24,
"美普波旁 Mihono Bourbon (Umamusume Pretty Derby)": 25,
"目白雷恩 Mejiro Ryan (Umamusume Pretty Derby)": 26,
"雪之美人 Yukino Bijin (Umamusume Pretty Derby)": 28,
"米浴 Rice Shower (Umamusume Pretty Derby)": 29,
"艾尼斯风神 Ines Fujin (Umamusume Pretty Derby)": 30,
"爱丽速子 Agnes Tachyon (Umamusume Pretty Derby)": 31,
"爱慕织姬 Admire Vega (Umamusume Pretty Derby)": 32,
"稻荷一 Inari One (Umamusume Pretty Derby)": 33,
"胜利奖券 Winning Ticket (Umamusume Pretty Derby)": 34,
"空中神宫 Air Shakur (Umamusume Pretty Derby)": 35,
"荣进闪耀 Eishin Flash (Umamusume Pretty Derby)": 36,
"真机伶 Curren Chan (Umamusume Pretty Derby)": 37,
"川上公主 Kawakami Princess (Umamusume Pretty Derby)": 38,
"黄金城市 Gold City (Umamusume Pretty Derby)": 39,
"樱花进王 Sakura Bakushin O (Umamusume Pretty Derby)": 40,
"采珠 Seeking the Pearl (Umamusume Pretty Derby)": 41,
"新光风 Shinko Windy (Umamusume Pretty Derby)": 42,
"东商变革 Sweep Tosho (Umamusume Pretty Derby)": 43,
"超级小溪 Super Creek (Umamusume Pretty Derby)": 44,
"醒目飞鹰 Smart Falcon (Umamusume Pretty Derby)": 45,
"荒漠英雄 Zenno Rob Roy (Umamusume Pretty Derby)": 46,
"东瀛佐敦 Tosen Jordan (Umamusume Pretty Derby)": 47,
"中山庆典 Nakayama Festa (Umamusume Pretty Derby)": 48,
"成田大进 Narita Taishin (Umamusume Pretty Derby)": 49,
"西野花 Nishino Flower (Umamusume Pretty Derby)": 50,
"春乌拉拉 Haru Urara (Umamusume Pretty Derby)": 51,
"青竹回忆 Bamboo Memory (Umamusume Pretty Derby)": 52,
"待兼福来 Matikane Fukukitaru (Umamusume Pretty Derby)": 55,
"名将怒涛 Meisho Doto (Umamusume Pretty Derby)": 57,
"目白多伯 Mejiro Dober (Umamusume Pretty Derby)": 58,
"优秀素质 Nice Nature (Umamusume Pretty Derby)": 59,
"帝王光环 King Halo (Umamusume Pretty Derby)": 60,
"待兼诗歌剧 Matikane Tannhauser (Umamusume Pretty Derby)": 61,
"生野狄杜斯 Ikuno Dictus (Umamusume Pretty Derby)": 62,
"目白善信 Mejiro Palmer (Umamusume Pretty Derby)": 63,
"大拓太阳神 Daitaku Helios (Umamusume Pretty Derby)": 64,
"双涡轮 Twin Turbo (Umamusume Pretty Derby)": 65,
"里见光钻 Satono Diamond (Umamusume Pretty Derby)": 66,
"北部玄驹 Kitasan Black (Umamusume Pretty Derby)": 67,
"樱花千代王 Sakura Chiyono O (Umamusume Pretty Derby)": 68,
"天狼星象征 Sirius Symboli (Umamusume Pretty Derby)": 69,
"目白阿尔丹 Mejiro Ardan (Umamusume Pretty Derby)": 70,
"八重无敌 Yaeno Muteki (Umamusume Pretty Derby)": 71,
"鹤丸刚志 Tsurumaru Tsuyoshi (Umamusume Pretty Derby)": 72,
"目白光明 Mejiro Bright (Umamusume Pretty Derby)": 73,
"樱花桂冠 Sakura Laurel (Umamusume Pretty Derby)": 74,
"成田路 Narita Top Road (Umamusume Pretty Derby)": 75,
"也文摄辉 Yamanin Zephyr (Umamusume Pretty Derby)": 76,
"真弓快车 Aston Machan (Umamusume Pretty Derby)": 80,
"骏川手纲 Hayakawa Tazuna (Umamusume Pretty Derby)": 81,
"小林历奇 Kopano Rickey (Umamusume Pretty Derby)": 83,
"奇锐骏 Wonder Acute (Umamusume Pretty Derby)": 85,
"秋川理事长 President Akikawa (Umamusume Pretty Derby)": 86,
"綾地 寧々 Ayachi Nene (Sanoba Witch)": 87,
"因幡 めぐる Inaba Meguru (Sanoba Witch)": 88,
"椎葉 紬 Shiiba Tsumugi (Sanoba Witch)": 89,
"仮屋 和奏 Kariya Wakama (Sanoba Witch)": 90,
"戸隠 憧子 Togakushi Touko (Sanoba Witch)": 91,
"九条裟罗 Kujou Sara (Genshin Impact)": 92,
"芭芭拉 Barbara (Genshin Impact)": 93,
"派蒙 Paimon (Genshin Impact)": 94,
"荒泷一斗 Arataki Itto (Genshin Impact)": 96,
"早柚 Sayu (Genshin Impact)": 97,
"香菱 Xiangling (Genshin Impact)": 98,
"神里绫华 Kamisato Ayaka (Genshin Impact)": 99,
"重云 Chongyun (Genshin Impact)": 100,
"流浪者 Wanderer (Genshin Impact)": 102,
"优菈 Eula (Genshin Impact)": 103,
"凝光 Ningguang (Genshin Impact)": 105,
"钟离 Zhongli (Genshin Impact)": 106,
"雷电将军 Raiden Shogun (Genshin Impact)": 107,
"枫原万叶 Kaedehara Kazuha (Genshin Impact)": 108,
"赛诺 Cyno (Genshin Impact)": 109,
"诺艾尔 Noelle (Genshin Impact)": 112,
"八重神子 Yae Miko (Genshin Impact)": 113,
"凯亚 Kaeya (Genshin Impact)": 114,
"魈 Xiao (Genshin Impact)": 115,
"托马 Thoma (Genshin Impact)": 116,
"可莉 Klee (Genshin Impact)": 117,
"迪卢克 Diluc (Genshin Impact)": 120,
"夜兰 Yelan (Genshin Impact)": 121,
"鹿野院平藏 Shikanoin Heizou (Genshin Impact)": 123,
"辛焱 Xinyan (Genshin Impact)": 124,
"丽莎 Lisa (Genshin Impact)": 125,
"云堇 Yun Jin (Genshin Impact)": 126,
"坎蒂丝 Candace (Genshin Impact)": 127,
"罗莎莉亚 Rosaria (Genshin Impact)": 128,
"北斗 Beidou (Genshin Impact)": 129,
"珊瑚宫心海 Sangonomiya Kokomi (Genshin Impact)": 132,
"烟绯 Yanfei (Genshin Impact)": 133,
"久岐忍 Kuki Shinobu (Genshin Impact)": 136,
"宵宫 Yoimiya (Genshin Impact)": 139,
"安柏 Amber (Genshin Impact)": 143,
"迪奥娜 Diona (Genshin Impact)": 144,
"班尼特 Bennett (Genshin Impact)": 146,
"雷泽 Razor (Genshin Impact)": 147,
"阿贝多 Albedo (Genshin Impact)": 151,
"温迪 Venti (Genshin Impact)": 152,
"空 Player Male (Genshin Impact)": 153,
"神里绫人 Kamisato Ayato (Genshin Impact)": 154,
"琴 Jean (Genshin Impact)": 155,
"艾尔海森 Alhaitham (Genshin Impact)": 156,
"莫娜 Mona (Genshin Impact)": 157,
"妮露 Nilou (Genshin Impact)": 159,
"胡桃 Hu Tao (Genshin Impact)": 160,
"甘雨 Ganyu (Genshin Impact)": 161,
"纳西妲 Nahida (Genshin Impact)": 162,
"刻晴 Keqing (Genshin Impact)": 165,
"荧 Player Female (Genshin Impact)": 169,
"埃洛伊 Aloy (Genshin Impact)": 179,
"柯莱 Collei (Genshin Impact)": 182,
"多莉 Dori (Genshin Impact)": 184,
"提纳里 Tighnari (Genshin Impact)": 186,
"砂糖 Sucrose (Genshin Impact)": 188,
"行秋 Xingqiu (Genshin Impact)": 190,
"奥兹 Oz (Genshin Impact)": 193,
"五郎 Gorou (Genshin Impact)": 198,
"达达利亚 Tartalia (Genshin Impact)": 202,
"七七 Qiqi (Genshin Impact)": 207,
"申鹤 Shenhe (Genshin Impact)": 217,
"莱依拉 Layla (Genshin Impact)": 228,
"菲谢尔 Fishl (Genshin Impact)": 230,
"User": 999
}
}
+172
View File
@@ -0,0 +1,172 @@
{
"train": {
"log_interval": 10,
"eval_interval": 100,
"seed": 1234,
"epochs": 10000,
"learning_rate": 0.0002,
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"batch_size": 16,
"fp16_run": true,
"lr_decay": 0.999875,
"segment_size": 8192,
"init_lr_ratio": 1,
"warmup_epochs": 0,
"c_mel": 45,
"c_kl": 1.0
},
"data": {
"training_files": "final_annotation_train.txt",
"validation_files": "final_annotation_val.txt",
"text_cleaners": [
"chinese_cleaners"
],
"max_wav_value": 32768.0,
"sampling_rate": 22050,
"filter_length": 1024,
"hop_length": 256,
"win_length": 1024,
"n_mel_channels": 80,
"mel_fmin": 0.0,
"mel_fmax": null,
"add_blank": true,
"n_speakers": 2,
"cleaned_text": true
},
"model": {
"inter_channels": 192,
"hidden_channels": 192,
"filter_channels": 768,
"n_heads": 2,
"n_layers": 6,
"kernel_size": 3,
"p_dropout": 0.1,
"resblock": "1",
"resblock_kernel_sizes": [
3,
7,
11
],
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates": [
8,
8,
2,
2
],
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
4,
4
],
"n_layers_q": 3,
"use_spectral_norm": false,
"gin_channels": 256
},
"symbols": [
"_",
"\uff1b",
"\uff1a",
"\uff0c",
"\u3002",
"\uff01",
"\uff1f",
"-",
"\u201c",
"\u201d",
"\u300a",
"\u300b",
"\u3001",
"\uff08",
"\uff09",
"\u2026",
"\u2014",
" ",
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
"Q",
"R",
"S",
"T",
"U",
"V",
"W",
"X",
"Y",
"Z",
"a",
"b",
"c",
"d",
"e",
"f",
"g",
"h",
"i",
"j",
"k",
"l",
"m",
"n",
"o",
"p",
"q",
"r",
"s",
"t",
"u",
"v",
"w",
"x",
"y",
"z",
"1",
"2",
"3",
"4",
"5",
"0",
"\uff22",
"\uff30"
],
"speakers": {
"dingzhen": 0,
"taffy": 1
}
}
+17 -147
View File
@@ -10,146 +10,6 @@ import commons
from mel_processing import spectrogram_torch
from utils import load_wav_to_torch, load_filepaths_and_text
from text import text_to_sequence, cleaned_text_to_sequence
class TextAudioLoader(torch.utils.data.Dataset):
"""
1) loads audio, text pairs
2) normalizes text and converts them to sequences of integers
3) computes spectrograms from audio files.
"""
def __init__(self, audiopaths_and_text, hparams):
self.audiopaths_and_text = load_filepaths_and_text(audiopaths_and_text)
self.text_cleaners = hparams.text_cleaners
self.max_wav_value = hparams.max_wav_value
self.sampling_rate = hparams.sampling_rate
self.filter_length = hparams.filter_length
self.hop_length = hparams.hop_length
self.win_length = hparams.win_length
self.sampling_rate = hparams.sampling_rate
self.cleaned_text = getattr(hparams, "cleaned_text", False)
self.add_blank = hparams.add_blank
self.min_text_len = getattr(hparams, "min_text_len", 1)
self.max_text_len = getattr(hparams, "max_text_len", 190)
random.seed(1234)
random.shuffle(self.audiopaths_and_text)
self._filter()
def _filter(self):
"""
Filter text & store spec lengths
"""
# Store spectrogram lengths for Bucketing
# wav_length ~= file_size / (wav_channels * Bytes per dim) = file_size / (1 * 2)
# spec_length = wav_length // hop_length
audiopaths_and_text_new = []
lengths = []
for audiopath, text in self.audiopaths_and_text:
if self.min_text_len <= len(text) and len(text) <= self.max_text_len:
audiopaths_and_text_new.append([audiopath, text])
lengths.append(os.path.getsize(audiopath) // (2 * self.hop_length))
self.audiopaths_and_text = audiopaths_and_text_new
self.lengths = lengths
def get_audio_text_pair(self, audiopath_and_text):
# separate filename and text
audiopath, text = audiopath_and_text[0], audiopath_and_text[1]
text = self.get_text(text)
spec, wav = self.get_audio(audiopath)
return (text, spec, wav)
def get_audio(self, filename):
audio, sampling_rate = load_wav_to_torch(filename)
if sampling_rate != self.sampling_rate:
raise ValueError("{} {} SR doesn't match target {} SR".format(
sampling_rate, self.sampling_rate))
audio_norm = audio / self.max_wav_value
audio_norm = audio_norm.unsqueeze(0)
spec_filename = filename.replace(".wav", ".spec.pt")
if os.path.exists(spec_filename):
spec = torch.load(spec_filename)
else:
spec = spectrogram_torch(audio_norm, self.filter_length,
self.sampling_rate, self.hop_length, self.win_length,
center=False)
spec = torch.squeeze(spec, 0)
torch.save(spec, spec_filename)
return spec, audio_norm
def get_text(self, text):
if self.cleaned_text:
text_norm = cleaned_text_to_sequence(text)
else:
text_norm = text_to_sequence(text, self.text_cleaners)
if self.add_blank:
text_norm = commons.intersperse(text_norm, 0)
text_norm = torch.LongTensor(text_norm)
return text_norm
def __getitem__(self, index):
return self.get_audio_text_pair(self.audiopaths_and_text[index])
def __len__(self):
return len(self.audiopaths_and_text)
class TextAudioCollate():
""" Zero-pads model inputs and targets
"""
def __init__(self, return_ids=False):
self.return_ids = return_ids
def __call__(self, batch):
"""Collate's training batch from normalized text and aduio
PARAMS
------
batch: [text_normalized, spec_normalized, wav_normalized]
"""
# Right zero-pad all one-hot text sequences to max input length
_, ids_sorted_decreasing = torch.sort(
torch.LongTensor([x[1].size(1) for x in batch]),
dim=0, descending=True)
max_text_len = max([len(x[0]) for x in batch])
max_spec_len = max([x[1].size(1) for x in batch])
max_wav_len = max([x[2].size(1) for x in batch])
text_lengths = torch.LongTensor(len(batch))
spec_lengths = torch.LongTensor(len(batch))
wav_lengths = torch.LongTensor(len(batch))
text_padded = torch.LongTensor(len(batch), max_text_len)
spec_padded = torch.FloatTensor(len(batch), batch[0][1].size(0), max_spec_len)
wav_padded = torch.FloatTensor(len(batch), 1, max_wav_len)
text_padded.zero_()
spec_padded.zero_()
wav_padded.zero_()
for i in range(len(ids_sorted_decreasing)):
row = batch[ids_sorted_decreasing[i]]
text = row[0]
text_padded[i, :text.size(0)] = text
text_lengths[i] = text.size(0)
spec = row[1]
spec_padded[i, :, :spec.size(1)] = spec
spec_lengths[i] = spec.size(1)
wav = row[2]
wav_padded[i, :, :wav.size(1)] = wav
wav_lengths[i] = wav.size(1)
if self.return_ids:
return text_padded, text_lengths, spec_padded, spec_lengths, wav_padded, wav_lengths, ids_sorted_decreasing
return text_padded, text_lengths, spec_padded, spec_lengths, wav_padded, wav_lengths
"""Multi speaker version"""
@@ -160,7 +20,7 @@ class TextAudioSpeakerLoader(torch.utils.data.Dataset):
3) computes spectrograms from audio files.
"""
def __init__(self, audiopaths_sid_text, hparams):
def __init__(self, audiopaths_sid_text, hparams, symbols):
self.audiopaths_sid_text = load_filepaths_and_text(audiopaths_sid_text)
self.text_cleaners = hparams.text_cleaners
self.max_wav_value = hparams.max_wav_value
@@ -175,6 +35,7 @@ class TextAudioSpeakerLoader(torch.utils.data.Dataset):
self.add_blank = hparams.add_blank
self.min_text_len = getattr(hparams, "min_text_len", 1)
self.max_text_len = getattr(hparams, "max_text_len", 190)
self.symbols = symbols
random.seed(1234)
random.shuffle(self.audiopaths_sid_text)
@@ -232,7 +93,7 @@ class TextAudioSpeakerLoader(torch.utils.data.Dataset):
def get_text(self, text):
if self.cleaned_text:
text_norm = cleaned_text_to_sequence(text)
text_norm = cleaned_text_to_sequence(text, self.symbols)
else:
text_norm = text_to_sequence(text, self.text_cleaners)
if self.add_blank:
@@ -334,10 +195,19 @@ class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler):
if idx_bucket != -1:
buckets[idx_bucket].append(i)
for i in range(len(buckets) - 1, 0, -1):
if len(buckets[i]) == 0:
buckets.pop(i)
self.boundaries.pop(i + 1)
try:
for i in range(len(buckets) - 1, 0, -1):
if len(buckets[i]) == 0:
buckets.pop(i)
self.boundaries.pop(i + 1)
assert all(len(bucket) > 0 for bucket in buckets)
# When one bucket is not traversed
except Exception as e:
print('Bucket warning ', e)
for i in range(len(buckets) - 1, -1, -1):
if len(buckets[i]) == 0:
buckets.pop(i)
self.boundaries.pop(i + 1)
num_samples_per_bucket = []
for i in range(len(buckets)):
@@ -403,4 +273,4 @@ class DistributedBucketSampler(torch.utils.data.distributed.DistributedSampler):
return -1
def __len__(self):
return self.num_samples // self.batch_size
return self.num_samples // self.batch_size
-23
View File
@@ -1,23 +0,0 @@
import os
import torch
import torchaudio
audio_dir = "./user_voice/"
wavfiles = []
for filename in list(os.walk(audio_dir))[0][2]:
if filename.endswith(".wav"):
wavfiles.append(filename)
# denoise with demucs
for i, wavfile in enumerate(wavfiles):
os.system(f"demucs --two-stems=vocals {audio_dir}{wavfile}")
# read & store the denoised vocals back
for wavfile in wavfiles:
i = wavfile.strip(".wav")
wav, sr = torchaudio.load(f"./separated/htdemucs/{i}/vocals.wav", frame_offset=0, num_frames=-1, normalize=True, channels_first=True)
# merge two channels into one
wav = wav.mean(dim=0).unsqueeze(0)
if sr != 22050:
wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=22050)(wav)
torchaudio.save(f"./user_voice/{i}.wav", wav, 22050, channels_first=True)
-3
View File
@@ -1,3 +0,0 @@
from google.colab import files
files.download("./OUTPUT_MODEL/G_latest.pth")
files.download("./OUTPUT_MODEL/config.json")
+73 -20
View File
@@ -65,11 +65,12 @@ def run(rank, n_gpus, hps):
writer = SummaryWriter(log_dir=hps.model_dir)
writer_eval = SummaryWriter(log_dir=os.path.join(hps.model_dir, "eval"))
dist.init_process_group(backend='nccl', init_method='env://', world_size=n_gpus, rank=rank)
# Use gloo backend on Windows for Pytorch
dist.init_process_group(backend= 'gloo' if os.name == 'nt' else 'nccl', init_method='env://', world_size=n_gpus, rank=rank)
torch.manual_seed(hps.train.seed)
torch.cuda.set_device(rank)
train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps.data)
train_dataset = TextAudioSpeakerLoader(hps.data.training_files, hps.data, symbols)
train_sampler = DistributedBucketSampler(
train_dataset,
hps.train.batch_size,
@@ -78,12 +79,12 @@ def run(rank, n_gpus, hps):
rank=rank,
shuffle=True)
collate_fn = TextAudioSpeakerCollate()
train_loader = DataLoader(train_dataset, num_workers=0, shuffle=False, pin_memory=True,
train_loader = DataLoader(train_dataset, num_workers=2, shuffle=False, pin_memory=True,
collate_fn=collate_fn, batch_sampler=train_sampler)
# train_loader = DataLoader(train_dataset, batch_size=hps.train.batch_size, num_workers=0, shuffle=False, pin_memory=True,
# train_loader = DataLoader(train_dataset, batch_size=hps.train.batch_size, num_workers=2, shuffle=False, pin_memory=True,
# collate_fn=collate_fn)
if rank == 0:
eval_dataset = TextAudioSpeakerLoader(hps.data.validation_files, hps.data)
eval_dataset = TextAudioSpeakerLoader(hps.data.validation_files, hps.data, symbols)
eval_loader = DataLoader(eval_dataset, num_workers=0, shuffle=False,
batch_size=hps.train.batch_size, pin_memory=True,
drop_last=False, collate_fn=collate_fn)
@@ -97,10 +98,30 @@ def run(rank, n_gpus, hps):
net_d = MultiPeriodDiscriminator(hps.model.use_spectral_norm).cuda(rank)
# load existing model
_, _, _, _ = utils.load_checkpoint("./pretrained_models/G_trilingual.pth", net_g, None)
_, _, _, _ = utils.load_checkpoint("./pretrained_models/D_trilingual.pth", net_d, None)
epoch_str = 1
global_step = 0
if hps.cont:
try:
_, _, _, epoch_str = utils.load_checkpoint(utils.latest_checkpoint_path(hps.model_dir, "G_latest.pth"), net_g, None)
_, _, _, epoch_str = utils.load_checkpoint(utils.latest_checkpoint_path(hps.model_dir, "D_latest.pth"), net_d, None)
global_step = (epoch_str - 1) * len(train_loader)
except:
print("Failed to find latest checkpoint, loading G_0.pth...")
if hps.train_with_pretrained_model:
print("Train with pretrained model...")
_, _, _, epoch_str = utils.load_checkpoint("./pretrained_models/G_0.pth", net_g, None)
_, _, _, epoch_str = utils.load_checkpoint("./pretrained_models/D_0.pth", net_d, None)
else:
print("Train without pretrained model...")
epoch_str = 1
global_step = 0
else:
if hps.train_with_pretrained_model:
print("Train with pretrained model...")
_, _, _, epoch_str = utils.load_checkpoint("./pretrained_models/G_0.pth", net_g, None)
_, _, _, epoch_str = utils.load_checkpoint("./pretrained_models/D_0.pth", net_d, None)
else:
print("Train without pretrained model...")
epoch_str = 1
global_step = 0
# freeze all other layers except speaker embedding
for p in net_g.parameters():
p.requires_grad = True
@@ -239,18 +260,50 @@ def train_and_evaluate(rank, epoch, hps, nets, optims, schedulers, scaler, loade
if global_step % hps.train.eval_interval == 0:
evaluate(hps, net_g, eval_loader, writer_eval)
utils.save_checkpoint(net_g, None, hps.train.learning_rate, epoch, os.path.join(hps.model_dir, "G_{}.pth".format(global_step)))
utils.save_checkpoint(net_g, None, hps.train.learning_rate, epoch,
os.path.join(hps.model_dir, "G_latest.pth".format(global_step)))
# utils.save_checkpoint(net_d, optim_d, hps.train.learning_rate, epoch, os.path.join(hps.model_dir, "D_{}.pth".format(global_step)))
old_g=os.path.join(hps.model_dir, "G_{}.pth".format(global_step-4000))
# old_d=os.path.join(hps.model_dir, "D_{}.pth".format(global_step-400))
if os.path.exists(old_g):
os.remove(old_g)
# if os.path.exists(old_d):
# os.remove(old_d)
os.path.join(hps.model_dir, "G_latest.pth"))
utils.save_checkpoint(net_d, None, hps.train.learning_rate, epoch,
os.path.join(hps.model_dir, "D_latest.pth"))
# save to google drive
if os.path.exists("/content/drive/MyDrive/"):
utils.save_checkpoint(net_g, None, hps.train.learning_rate, epoch,
os.path.join("/content/drive/MyDrive/", "G_latest.pth"))
utils.save_checkpoint(net_d, None, hps.train.learning_rate, epoch,
os.path.join("/content/drive/MyDrive/", "D_latest.pth"))
if hps.preserved > 0:
utils.save_checkpoint(net_g, None, hps.train.learning_rate, epoch,
os.path.join(hps.model_dir, "G_{}.pth".format(global_step)))
utils.save_checkpoint(net_d, None, hps.train.learning_rate, epoch,
os.path.join(hps.model_dir, "D_{}.pth".format(global_step)))
old_g = utils.oldest_checkpoint_path(hps.model_dir, "G_[0-9]*.pth",
preserved=hps.preserved) # Preserve 4 (default) historical checkpoints.
old_d = utils.oldest_checkpoint_path(hps.model_dir, "D_[0-9]*.pth", preserved=hps.preserved)
if os.path.exists(old_g):
print(f"remove {old_g}")
os.remove(old_g)
if os.path.exists(old_d):
print(f"remove {old_d}")
os.remove(old_d)
if os.path.exists("/content/drive/MyDrive/"):
utils.save_checkpoint(net_g, None, hps.train.learning_rate, epoch,
os.path.join("/content/drive/MyDrive/", "G_{}.pth".format(global_step)))
utils.save_checkpoint(net_d, None, hps.train.learning_rate, epoch,
os.path.join("/content/drive/MyDrive/", "D_{}.pth".format(global_step)))
old_g = utils.oldest_checkpoint_path("/content/drive/MyDrive/", "G_[0-9]*.pth",
preserved=hps.preserved) # Preserve 4 (default) historical checkpoints.
old_d = utils.oldest_checkpoint_path("/content/drive/MyDrive/", "D_[0-9]*.pth", preserved=hps.preserved)
if os.path.exists(old_g):
print(f"remove {old_g}")
os.remove(old_g)
if os.path.exists(old_d):
print(f"remove {old_d}")
os.remove(old_d)
global_step += 1
if global_step == hps.n_steps + 1:
if epoch > hps.max_epochs:
print("Maximum epoch reached, closing training...")
exit()
if rank == 0:
@@ -316,4 +369,4 @@ def evaluate(hps, generator, eval_loader, writer_eval):
if __name__ == "__main__":
main()
main()
+3 -3
View File
@@ -64,7 +64,7 @@ def spectrogram_torch(y, n_fft, sampling_rate, hop_size, win_size, center=False)
y = y.squeeze(1)
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
center=center, pad_mode='reflect', normalized=False, onesided=True)
center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
return spec
@@ -101,8 +101,8 @@ def mel_spectrogram_torch(y, n_fft, num_mels, sampling_rate, hop_size, win_size,
y = torch.nn.functional.pad(y.unsqueeze(1), (int((n_fft-hop_size)/2), int((n_fft-hop_size)/2)), mode='reflect')
y = y.squeeze(1)
spec = torch.stft(y, n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
center=center, pad_mode='reflect', normalized=False, onesided=True)
spec = torch.stft(y.float(), n_fft, hop_length=hop_size, win_length=win_size, window=hann_window[wnsize_dtype_device],
center=center, pad_mode='reflect', normalized=False, onesided=True, return_complex=False)
spec = torch.sqrt(spec.pow(2).sum(-1) + 1e-6)
+5 -7
View File
@@ -1,16 +1,15 @@
import copy
import math
import torch
from torch import nn
from torch.nn import Conv1d, ConvTranspose1d, Conv2d
from torch.nn import functional as F
from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
import attentions
import commons
import modules
import attentions
import monotonic_align
from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
from commons import init_weights, get_padding
@@ -386,7 +385,6 @@ class MultiPeriodDiscriminator(torch.nn.Module):
return y_d_rs, y_d_gs, fmap_rs, fmap_gs
class SynthesizerTrn(nn.Module):
"""
Synthesizer for Training
@@ -453,7 +451,7 @@ class SynthesizerTrn(nn.Module):
else:
self.dp = DurationPredictor(hidden_channels, 256, 3, 0.5, gin_channels=gin_channels)
if n_speakers > 1:
if n_speakers >= 1:
self.emb_g = nn.Embedding(n_speakers, gin_channels)
def forward(self, x, x_lengths, y, y_lengths, sid=None):
-402
View File
@@ -1,402 +0,0 @@
import math
import torch
from torch import nn
from torch.nn import functional as F
import commons
import modules
import attentions
from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
from commons import init_weights, get_padding
class StochasticDurationPredictor(nn.Module):
def __init__(self, in_channels, filter_channels, kernel_size, p_dropout, n_flows=4, gin_channels=0):
super().__init__()
filter_channels = in_channels # it needs to be removed from future version.
self.in_channels = in_channels
self.filter_channels = filter_channels
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.n_flows = n_flows
self.gin_channels = gin_channels
self.log_flow = modules.Log()
self.flows = nn.ModuleList()
self.flows.append(modules.ElementwiseAffine(2))
for i in range(n_flows):
self.flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
self.flows.append(modules.Flip())
self.post_pre = nn.Conv1d(1, filter_channels, 1)
self.post_proj = nn.Conv1d(filter_channels, filter_channels, 1)
self.post_convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
self.post_flows = nn.ModuleList()
self.post_flows.append(modules.ElementwiseAffine(2))
for i in range(4):
self.post_flows.append(modules.ConvFlow(2, filter_channels, kernel_size, n_layers=3))
self.post_flows.append(modules.Flip())
self.pre = nn.Conv1d(in_channels, filter_channels, 1)
self.proj = nn.Conv1d(filter_channels, filter_channels, 1)
self.convs = modules.DDSConv(filter_channels, kernel_size, n_layers=3, p_dropout=p_dropout)
if gin_channels != 0:
self.cond = nn.Conv1d(gin_channels, filter_channels, 1)
def forward(self, x, x_mask, w=None, g=None, reverse=False, noise_scale=1.0):
x = torch.detach(x)
x = self.pre(x)
if g is not None:
g = torch.detach(g)
x = x + self.cond(g)
x = self.convs(x, x_mask)
x = self.proj(x) * x_mask
if not reverse:
flows = self.flows
assert w is not None
logdet_tot_q = 0
h_w = self.post_pre(w)
h_w = self.post_convs(h_w, x_mask)
h_w = self.post_proj(h_w) * x_mask
e_q = torch.randn(w.size(0), 2, w.size(2)).to(device=x.device, dtype=x.dtype) * x_mask
z_q = e_q
for flow in self.post_flows:
z_q, logdet_q = flow(z_q, x_mask, g=(x + h_w))
logdet_tot_q += logdet_q
z_u, z1 = torch.split(z_q, [1, 1], 1)
u = torch.sigmoid(z_u) * x_mask
z0 = (w - u) * x_mask
logdet_tot_q += torch.sum((F.logsigmoid(z_u) + F.logsigmoid(-z_u)) * x_mask, [1,2])
logq = torch.sum(-0.5 * (math.log(2*math.pi) + (e_q**2)) * x_mask, [1,2]) - logdet_tot_q
logdet_tot = 0
z0, logdet = self.log_flow(z0, x_mask)
logdet_tot += logdet
z = torch.cat([z0, z1], 1)
for flow in flows:
z, logdet = flow(z, x_mask, g=x, reverse=reverse)
logdet_tot = logdet_tot + logdet
nll = torch.sum(0.5 * (math.log(2*math.pi) + (z**2)) * x_mask, [1,2]) - logdet_tot
return nll + logq # [b]
else:
flows = list(reversed(self.flows))
flows = flows[:-2] + [flows[-1]] # remove a useless vflow
z = torch.randn(x.size(0), 2, x.size(2)).to(device=x.device, dtype=x.dtype) * noise_scale
for flow in flows:
z = flow(z, x_mask, g=x, reverse=reverse)
z0, z1 = torch.split(z, [1, 1], 1)
logw = z0
return logw
class DurationPredictor(nn.Module):
def __init__(self, in_channels, filter_channels, kernel_size, p_dropout, gin_channels=0):
super().__init__()
self.in_channels = in_channels
self.filter_channels = filter_channels
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.gin_channels = gin_channels
self.drop = nn.Dropout(p_dropout)
self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size, padding=kernel_size//2)
self.norm_1 = modules.LayerNorm(filter_channels)
self.conv_2 = nn.Conv1d(filter_channels, filter_channels, kernel_size, padding=kernel_size//2)
self.norm_2 = modules.LayerNorm(filter_channels)
self.proj = nn.Conv1d(filter_channels, 1, 1)
if gin_channels != 0:
self.cond = nn.Conv1d(gin_channels, in_channels, 1)
def forward(self, x, x_mask, g=None):
x = torch.detach(x)
if g is not None:
g = torch.detach(g)
x = x + self.cond(g)
x = self.conv_1(x * x_mask)
x = torch.relu(x)
x = self.norm_1(x)
x = self.drop(x)
x = self.conv_2(x * x_mask)
x = torch.relu(x)
x = self.norm_2(x)
x = self.drop(x)
x = self.proj(x * x_mask)
return x * x_mask
class TextEncoder(nn.Module):
def __init__(self,
n_vocab,
out_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout):
super().__init__()
self.n_vocab = n_vocab
self.out_channels = out_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.emb = nn.Embedding(n_vocab, hidden_channels)
nn.init.normal_(self.emb.weight, 0.0, hidden_channels**-0.5)
self.encoder = attentions.Encoder(
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout)
self.proj= nn.Conv1d(hidden_channels, out_channels * 2, 1)
def forward(self, x, x_lengths):
x = self.emb(x) * math.sqrt(self.hidden_channels) # [b, t, h]
x = torch.transpose(x, 1, -1) # [b, h, t]
x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
x = self.encoder(x * x_mask, x_mask)
stats = self.proj(x) * x_mask
m, logs = torch.split(stats, self.out_channels, dim=1)
return x, m, logs, x_mask
class ResidualCouplingBlock(nn.Module):
def __init__(self,
channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
n_flows=4,
gin_channels=0):
super().__init__()
self.channels = channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.n_flows = n_flows
self.gin_channels = gin_channels
self.flows = nn.ModuleList()
for i in range(n_flows):
self.flows.append(modules.ResidualCouplingLayer(channels, hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels, mean_only=True))
self.flows.append(modules.Flip())
def forward(self, x, x_mask, g=None, reverse=False):
if not reverse:
for flow in self.flows:
x, _ = flow(x, x_mask, g=g, reverse=reverse)
else:
for flow in reversed(self.flows):
x = flow(x, x_mask, g=g, reverse=reverse)
return x
class PosteriorEncoder(nn.Module):
def __init__(self,
in_channels,
out_channels,
hidden_channels,
kernel_size,
dilation_rate,
n_layers,
gin_channels=0):
super().__init__()
self.in_channels = in_channels
self.out_channels = out_channels
self.hidden_channels = hidden_channels
self.kernel_size = kernel_size
self.dilation_rate = dilation_rate
self.n_layers = n_layers
self.gin_channels = gin_channels
self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
self.enc = modules.WN(hidden_channels, kernel_size, dilation_rate, n_layers, gin_channels=gin_channels)
self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
def forward(self, x, x_lengths, g=None):
x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(x.dtype)
x = self.pre(x) * x_mask
x = self.enc(x, x_mask, g=g)
stats = self.proj(x) * x_mask
m, logs = torch.split(stats, self.out_channels, dim=1)
z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
return z, m, logs, x_mask
class Generator(torch.nn.Module):
def __init__(self, initial_channel, resblock, resblock_kernel_sizes, resblock_dilation_sizes, upsample_rates, upsample_initial_channel, upsample_kernel_sizes, gin_channels=0):
super(Generator, self).__init__()
self.num_kernels = len(resblock_kernel_sizes)
self.num_upsamples = len(upsample_rates)
self.conv_pre = Conv1d(initial_channel, upsample_initial_channel, 7, 1, padding=3)
resblock = modules.ResBlock1 if resblock == '1' else modules.ResBlock2
self.ups = nn.ModuleList()
for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
self.ups.append(weight_norm(
ConvTranspose1d(upsample_initial_channel//(2**i), upsample_initial_channel//(2**(i+1)),
k, u, padding=(k-u)//2)))
self.resblocks = nn.ModuleList()
for i in range(len(self.ups)):
ch = upsample_initial_channel//(2**(i+1))
for j, (k, d) in enumerate(zip(resblock_kernel_sizes, resblock_dilation_sizes)):
self.resblocks.append(resblock(ch, k, d))
self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
self.ups.apply(init_weights)
if gin_channels != 0:
self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
def forward(self, x, g=None):
x = self.conv_pre(x)
if g is not None:
x = x + self.cond(g)
for i in range(self.num_upsamples):
x = F.leaky_relu(x, modules.LRELU_SLOPE)
x = self.ups[i](x)
xs = None
for j in range(self.num_kernels):
if xs is None:
xs = self.resblocks[i*self.num_kernels+j](x)
else:
xs += self.resblocks[i*self.num_kernels+j](x)
x = xs / self.num_kernels
x = F.leaky_relu(x)
x = self.conv_post(x)
x = torch.tanh(x)
return x
def remove_weight_norm(self):
print('Removing weight norm...')
for l in self.ups:
remove_weight_norm(l)
for l in self.resblocks:
l.remove_weight_norm()
class SynthesizerTrn(nn.Module):
"""
Synthesizer for Training
"""
def __init__(self,
n_vocab,
spec_channels,
segment_size,
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout,
resblock,
resblock_kernel_sizes,
resblock_dilation_sizes,
upsample_rates,
upsample_initial_channel,
upsample_kernel_sizes,
n_speakers=0,
gin_channels=0,
use_sdp=True,
**kwargs):
super().__init__()
self.n_vocab = n_vocab
self.spec_channels = spec_channels
self.inter_channels = inter_channels
self.hidden_channels = hidden_channels
self.filter_channels = filter_channels
self.n_heads = n_heads
self.n_layers = n_layers
self.kernel_size = kernel_size
self.p_dropout = p_dropout
self.resblock = resblock
self.resblock_kernel_sizes = resblock_kernel_sizes
self.resblock_dilation_sizes = resblock_dilation_sizes
self.upsample_rates = upsample_rates
self.upsample_initial_channel = upsample_initial_channel
self.upsample_kernel_sizes = upsample_kernel_sizes
self.segment_size = segment_size
self.n_speakers = n_speakers
self.gin_channels = gin_channels
self.use_sdp = use_sdp
self.enc_p = TextEncoder(n_vocab,
inter_channels,
hidden_channels,
filter_channels,
n_heads,
n_layers,
kernel_size,
p_dropout)
self.dec = Generator(inter_channels, resblock, resblock_kernel_sizes, resblock_dilation_sizes, upsample_rates, upsample_initial_channel, upsample_kernel_sizes, gin_channels=gin_channels)
self.enc_q = PosteriorEncoder(spec_channels, inter_channels, hidden_channels, 5, 1, 16, gin_channels=gin_channels)
self.flow = ResidualCouplingBlock(inter_channels, hidden_channels, 5, 1, 4, gin_channels=gin_channels)
if use_sdp:
self.dp = StochasticDurationPredictor(hidden_channels, 192, 3, 0.5, 4, gin_channels=gin_channels)
else:
self.dp = DurationPredictor(hidden_channels, 256, 3, 0.5, gin_channels=gin_channels)
if n_speakers > 1:
self.emb_g = nn.Embedding(n_speakers, gin_channels)
def infer(self, x, x_lengths, sid=None, noise_scale=1, length_scale=1, noise_scale_w=1., max_len=None):
x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths)
if self.n_speakers > 0:
g = self.emb_g(sid).unsqueeze(-1) # [b, h, 1]
else:
g = None
if self.use_sdp:
logw = self.dp(x, x_mask, g=g, reverse=True, noise_scale=noise_scale_w)
else:
logw = self.dp(x, x_mask, g=g)
w = torch.exp(logw) * x_mask * length_scale
w_ceil = torch.ceil(w)
y_lengths = torch.clamp_min(torch.sum(w_ceil, [1, 2]), 1).long()
y_mask = torch.unsqueeze(commons.sequence_mask(y_lengths, None), 1).to(x_mask.dtype)
attn_mask = torch.unsqueeze(x_mask, 2) * torch.unsqueeze(y_mask, -1)
attn = commons.generate_path(w_ceil, attn_mask)
m_p = torch.matmul(attn.squeeze(1), m_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
logs_p = torch.matmul(attn.squeeze(1), logs_p.transpose(1, 2)).transpose(1, 2) # [b, t', t], [b, t, d] -> [b, d, t']
z_p = m_p + torch.randn_like(m_p) * torch.exp(logs_p) * noise_scale
z = self.flow(z_p, y_mask, g=g, reverse=True)
o = self.dec((z * y_mask)[:,:,:max_len], g=g)
return o, attn, y_mask, (z, z_p, m_p, logs_p)
def voice_conversion(self, y, y_lengths, sid_src, sid_tgt):
assert self.n_speakers > 0, "n_speakers have to be larger than 0."
g_src = self.emb_g(sid_src).unsqueeze(-1)
g_tgt = self.emb_g(sid_tgt).unsqueeze(-1)
z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g_src)
z_p = self.flow(z, y_mask, g=g_src)
z_hat = self.flow(z_p, y_mask, g=g_tgt, reverse=True)
o_hat = self.dec(z_hat * y_mask, g=g_tgt)
return o_hat, y_mask, (z, z_p, z_hat)
+1 -1
View File
@@ -69,7 +69,7 @@ class ConvReluNorm(nn.Module):
class DDSConv(nn.Module):
"""
Dialted and Depth-Separable Convolution
Dilated and Depth-Separable Convolution
"""
def __init__(self, channels, kernel_size, n_layers, p_dropout=0.):
super().__init__()
-28
View File
@@ -1,28 +0,0 @@
import os
MIN_VOICE_NUM = 10
if __name__ == "__main__":
# load sampled_audio4ft
with open("sampled_audio4ft.txt", 'r', encoding='utf-8') as f:
old_annos = f.readlines()
num_old_voices = len(old_annos)
# load user text
with open("./user_voice/user_voice.txt.cleaned", 'r', encoding='utf-8') as f:
user_annos = f.readlines()
# check how many voices are recorded
wavfiles = [file for file in list(os.walk("./user_voice"))[0][2] if file.endswith(".wav")]
num_user_voices = len(wavfiles)
if num_user_voices < MIN_VOICE_NUM:
raise Exception(f"You need to record at least {MIN_VOICE_NUM} voices for fine-tuning!")
# user voices need to occupy 1/4 of the total dataset
duplicate = num_old_voices // num_user_voices // 3
# find corresponding existing annotation lines
actual_user_annos = ["./user_voice/" + line for line in user_annos if line.split("|")[0] in wavfiles]
final_annos = old_annos + actual_user_annos * duplicate
# save annotation file
with open("final_annotation_train.txt", 'w', encoding='utf-8') as f:
for line in final_annos:
f.write(line)
# save annotation file for validation
with open("final_annotation_val.txt", 'w', encoding='utf-8') as f:
for line in actual_user_annos:
f.write(line)
+154
View File
@@ -0,0 +1,154 @@
import os
import argparse
import json
import sys
sys.setrecursionlimit(500000) # Fix the error message of RecursionError: maximum recursion depth exceeded while calling a Python object. You can change the number as you want.
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--add_auxiliary_data", type=bool, help="Whether to add extra data as fine-tuning helper")
parser.add_argument("--languages", default="CJE")
args = parser.parse_args()
if args.languages == "CJE":
langs = ["[ZH]", "[JA]", "[EN]"]
elif args.languages == "CJ":
langs = ["[ZH]", "[JA]"]
elif args.languages == "C":
langs = ["[ZH]"]
new_annos = []
# Source 1: transcribed short audios
if os.path.exists("short_character_anno.txt"):
with open("short_character_anno.txt", 'r', encoding='utf-8') as f:
short_character_anno = f.readlines()
new_annos += short_character_anno
# Source 2: transcribed long audio segments
if os.path.exists("./long_character_anno.txt"):
with open("./long_character_anno.txt", 'r', encoding='utf-8') as f:
long_character_anno = f.readlines()
new_annos += long_character_anno
# Get all speaker names
speakers = []
for line in new_annos:
path, speaker, text = line.split("|")
if speaker not in speakers:
speakers.append(speaker)
assert (len(speakers) != 0), "No audio file found. Please check your uploaded file structure."
# Source 3 (Optional): sampled audios as extra training helpers
if args.add_auxiliary_data:
with open("./sampled_audio4ft.txt", 'r', encoding='utf-8') as f:
old_annos = f.readlines()
# filter old_annos according to supported languages
filtered_old_annos = []
for line in old_annos:
for lang in langs:
if lang in line:
filtered_old_annos.append(line)
old_annos = filtered_old_annos
for line in old_annos:
path, speaker, text = line.split("|")
if speaker not in speakers:
speakers.append(speaker)
num_old_voices = len(old_annos)
num_new_voices = len(new_annos)
# STEP 1: balance number of new & old voices
cc_duplicate = num_old_voices // num_new_voices
if cc_duplicate == 0:
cc_duplicate = 1
# STEP 2: modify config file
with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f:
hps = json.load(f)
# assign ids to new speakers
speaker2id = {}
for i, speaker in enumerate(speakers):
speaker2id[speaker] = i
# modify n_speakers
hps['data']["n_speakers"] = len(speakers)
# overwrite speaker names
hps['speakers'] = speaker2id
hps['train']['log_interval'] = 10
hps['train']['eval_interval'] = 100
hps['train']['batch_size'] = 16
hps['data']['training_files'] = "final_annotation_train.txt"
hps['data']['validation_files'] = "final_annotation_val.txt"
# save modified config
with open("./configs/modified_finetune_speaker.json", 'w', encoding='utf-8') as f:
json.dump(hps, f, indent=2)
# STEP 3: clean annotations, replace speaker names with assigned speaker IDs
import text
cleaned_new_annos = []
for i, line in enumerate(new_annos):
path, speaker, txt = line.split("|")
if len(txt) > 150:
continue
cleaned_text = text._clean_text(txt, hps['data']['text_cleaners'])
cleaned_text += "\n" if not cleaned_text.endswith("\n") else ""
cleaned_new_annos.append(path + "|" + str(speaker2id[speaker]) + "|" + cleaned_text)
cleaned_old_annos = []
for i, line in enumerate(old_annos):
path, speaker, txt = line.split("|")
if len(txt) > 150:
continue
cleaned_text = text._clean_text(txt, hps['data']['text_cleaners'])
cleaned_text += "\n" if not cleaned_text.endswith("\n") else ""
cleaned_old_annos.append(path + "|" + str(speaker2id[speaker]) + "|" + cleaned_text)
# merge with old annotation
final_annos = cleaned_old_annos + cc_duplicate * cleaned_new_annos
# save annotation file
with open("./final_annotation_train.txt", 'w', encoding='utf-8') as f:
for line in final_annos:
f.write(line)
# save annotation file for validation
with open("./final_annotation_val.txt", 'w', encoding='utf-8') as f:
for line in cleaned_new_annos:
f.write(line)
print("finished")
else:
# Do not add extra helper data
# STEP 1: modify config file
with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f:
hps = json.load(f)
# assign ids to new speakers
speaker2id = {}
for i, speaker in enumerate(speakers):
speaker2id[speaker] = i
# modify n_speakers
hps['data']["n_speakers"] = len(speakers)
# overwrite speaker names
hps['speakers'] = speaker2id
hps['train']['log_interval'] = 10
hps['train']['eval_interval'] = 100
hps['train']['batch_size'] = 16
hps['data']['training_files'] = "final_annotation_train.txt"
hps['data']['validation_files'] = "final_annotation_val.txt"
# save modified config
with open("./configs/modified_finetune_speaker.json", 'w', encoding='utf-8') as f:
json.dump(hps, f, indent=2)
# STEP 2: clean annotations, replace speaker names with assigned speaker IDs
import text
cleaned_new_annos = []
for i, line in enumerate(new_annos):
path, speaker, txt = line.split("|")
if len(txt) > 150:
continue
cleaned_text = text._clean_text(txt, hps['data']['text_cleaners']).replace("[ZH]", "")
cleaned_text += "\n" if not cleaned_text.endswith("\n") else ""
cleaned_new_annos.append(path + "|" + str(speaker2id[speaker]) + "|" + cleaned_text)
final_annos = cleaned_new_annos
# save annotation file
with open("./final_annotation_train.txt", 'w', encoding='utf-8') as f:
for line in final_annos:
f.write(line)
# save annotation file for validation
with open("./final_annotation_val.txt", 'w', encoding='utf-8') as f:
for line in cleaned_new_annos:
f.write(line)
print("finished")
+10 -7
View File
@@ -1,13 +1,15 @@
Cython
librosa
numpy
Cython==0.29.21
librosa==0.9.2
matplotlib==3.3.1
scikit-learn==1.0.2
scipy
numpy==1.21.6
tensorboard
torch --extra-index-url https://download.pytorch.org/whl/cu116
torchvision --extra-index-url https://download.pytorch.org/whl/cu116
torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
torch
torchvision
torchaudio
unidecode
pyopenjtalk
pyopenjtalk-prebuilt
jamo
pypinyin
jieba
@@ -20,4 +22,5 @@ indic_transliteration==2.3.37
num_thai==0.0.5
opencc==1.1.1
demucs
git+https://github.com/openai/whisper.git
gradio
-8
View File
@@ -1,8 +0,0 @@
Cython
librosa
numpy
scipy
torch
torchaudio
unidecode
gradio
+22
View File
@@ -0,0 +1,22 @@
import os
import json
import torchaudio
raw_audio_dir = "./raw_audio/"
denoise_audio_dir = "./denoised_audio/"
filelist = list(os.walk(raw_audio_dir))[0][2]
# 2023/4/21: Get the target sampling rate
with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f:
hps = json.load(f)
target_sr = hps['data']['sampling_rate']
for file in filelist:
if file.endswith(".wav"):
os.system(f"demucs --two-stems=vocals {raw_audio_dir}{file}")
for file in filelist:
file = file.replace(".wav", "")
wav, sr = torchaudio.load(f"./separated/htdemucs/{file}/vocals.wav", frame_offset=0, num_frames=-1, normalize=True,
channels_first=True)
# merge two channels into one
wav = wav.mean(dim=0).unsqueeze(0)
if sr != target_sr:
wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav)
torchaudio.save(denoise_audio_dir + file + ".wav", wav, target_sr, channels_first=True)
+4
View File
@@ -0,0 +1,4 @@
from google.colab import files
files.download("./G_latest.pth")
files.download("./finetune_speaker.json")
files.download("./moegoe_config.json")
+37
View File
@@ -0,0 +1,37 @@
import os
import random
import shutil
from concurrent.futures import ThreadPoolExecutor
from google.colab import files
basepath = os.getcwd()
uploaded = files.upload() # 上传文件
for filename in uploaded.keys():
assert (filename.endswith(".txt")), "speaker-videolink info could only be .txt file!"
shutil.move(os.path.join(basepath, filename), os.path.join("./speaker_links.txt"))
def generate_infos():
infos = []
with open("./speaker_links.txt", 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
line = line.replace("\n", "").replace(" ", "")
if line == "":
continue
speaker, link = line.split("|")
filename = speaker + "_" + str(random.randint(0, 1000000))
infos.append({"link": link, "filename": filename})
return infos
def download_video(info):
link = info["link"]
filename = info["filename"]
os.system(f"youtube-dl -f 0 {link} -o ./video_data/{filename}.mp4 --no-check-certificate")
if __name__ == "__main__":
infos = generate_infos()
with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
executor.map(download_video, infos)
+75
View File
@@ -0,0 +1,75 @@
from moviepy.editor import AudioFileClip
import whisper
import os
import json
import torchaudio
import librosa
import torch
import argparse
parent_dir = "./denoised_audio/"
filelist = list(os.walk(parent_dir))[0][2]
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--languages", default="CJE")
parser.add_argument("--whisper_size", default="medium")
args = parser.parse_args()
if args.languages == "CJE":
lang2token = {
'zh': "[ZH]",
'ja': "[JA]",
"en": "[EN]",
}
elif args.languages == "CJ":
lang2token = {
'zh': "[ZH]",
'ja': "[JA]",
}
elif args.languages == "C":
lang2token = {
'zh': "[ZH]",
}
assert(torch.cuda.is_available()), "Please enable GPU in order to run Whisper!"
with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f:
hps = json.load(f)
target_sr = hps['data']['sampling_rate']
model = whisper.load_model(args.whisper_size)
speaker_annos = []
for file in filelist:
print(f"transcribing {parent_dir + file}...\n")
options = dict(beam_size=5, best_of=5)
transcribe_options = dict(task="transcribe", **options)
result = model.transcribe(parent_dir + file, word_timestamps=True, **transcribe_options)
segments = result["segments"]
# result = model.transcribe(parent_dir + file)
lang = result['language']
if result['language'] not in list(lang2token.keys()):
print(f"{lang} not supported, ignoring...\n")
continue
# segment audio based on segment results
character_name = file.rstrip(".wav").split("_")[0]
code = file.rstrip(".wav").split("_")[1]
if not os.path.exists("./segmented_character_voice/" + character_name):
os.mkdir("./segmented_character_voice/" + character_name)
wav, sr = torchaudio.load(parent_dir + file, frame_offset=0, num_frames=-1, normalize=True,
channels_first=True)
for i, seg in enumerate(result['segments']):
start_time = seg['start']
end_time = seg['end']
text = seg['text']
text = lang2token[lang] + text.replace("\n", "") + lang2token[lang]
text = text + "\n"
wav_seg = wav[:, int(start_time*sr):int(end_time*sr)]
wav_seg_name = f"{character_name}_{code}_{i}.wav"
savepth = "./segmented_character_voice/" + character_name + "/" + wav_seg_name
speaker_annos.append(savepth + "|" + character_name + "|" + text)
print(f"Transcribed segment: {speaker_annos[-1]}")
# trimmed_wav_seg = librosa.effects.trim(wav_seg.squeeze().numpy())
# trimmed_wav_seg = torch.tensor(trimmed_wav_seg[0]).unsqueeze(0)
torchaudio.save(savepth, wav_seg, target_sr, channels_first=True)
if len(speaker_annos) == 0:
print("Warning: no long audios & videos found, this IS expected if you have only uploaded short audios")
print("this IS NOT expected if you have uploaded any long audios, videos or video links. Please check your file structure or make sure your audio/video language is supported.")
with open("./long_character_anno.txt", 'w', encoding='utf-8') as f:
for line in speaker_annos:
f.write(line)
+37
View File
@@ -0,0 +1,37 @@
import torch
import argparse
import json
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model_dir", type=str, default="./OUTPUT_MODEL/G_latest.pth")
parser.add_argument("--config_dir", type=str, default="./configs/modified_finetune_speaker.json")
args = parser.parse_args()
model_sd = torch.load(args.model_dir, map_location='cpu')
with open(args.config_dir, 'r', encoding='utf-8') as f:
hps = json.load(f)
valid_speakers = list(hps['speakers'].keys())
if hps['data']['n_speakers'] > len(valid_speakers):
new_emb_g = torch.zeros([len(valid_speakers), 256])
old_emb_g = model_sd['model']['emb_g.weight']
for i, speaker in enumerate(valid_speakers):
new_emb_g[i, :] = old_emb_g[hps['speakers'][speaker], :]
hps['speakers'][speaker] = i
hps['data']['n_speakers'] = len(valid_speakers)
model_sd['model']['emb_g.weight'] = new_emb_g
with open("./finetune_speaker.json", 'w', encoding='utf-8') as f:
json.dump(hps, f, indent=2)
torch.save(model_sd, "./G_latest.pth")
else:
with open("./finetune_speaker.json", 'w', encoding='utf-8') as f:
json.dump(hps, f, indent=2)
torch.save(model_sd, "./G_latest.pth")
# save another config file copy in MoeGoe format
hps['speakers'] = valid_speakers
with open("./moegoe_config.json", 'w', encoding='utf-8') as f:
json.dump(hps, f, indent=2)
+20
View File
@@ -0,0 +1,20 @@
import os
import json
import argparse
import torchaudio
def main():
with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f:
hps = json.load(f)
target_sr = hps['data']['sampling_rate']
filelist = list(os.walk("./sampled_audio4ft"))[0][2]
if target_sr != 22050:
for wavfile in filelist:
wav, sr = torchaudio.load("./sampled_audio4ft" + "/" + wavfile, frame_offset=0, num_frames=-1,
normalize=True, channels_first=True)
wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav)
torchaudio.save("./sampled_audio4ft" + "/" + wavfile, wav, target_sr, channels_first=True)
if __name__ == "__main__":
main()
+121
View File
@@ -0,0 +1,121 @@
import whisper
import os
import json
import torchaudio
import argparse
import torch
lang2token = {
'zh': "[ZH]",
'ja': "[JA]",
"en": "[EN]",
}
def transcribe_one(audio_path):
# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio(audio_path)
audio = whisper.pad_or_trim(audio)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")
lang = max(probs, key=probs.get)
# decode the audio
options = whisper.DecodingOptions(beam_size=5)
result = whisper.decode(model, mel, options)
# print the recognized text
print(result.text)
return lang, result.text
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--languages", default="CJE")
parser.add_argument("--whisper_size", default="medium")
args = parser.parse_args()
if args.languages == "CJE":
lang2token = {
'zh': "[ZH]",
'ja': "[JA]",
"en": "[EN]",
}
elif args.languages == "CJ":
lang2token = {
'zh': "[ZH]",
'ja': "[JA]",
}
elif args.languages == "C":
lang2token = {
'zh': "[ZH]",
}
assert (torch.cuda.is_available()), "Please enable GPU in order to run Whisper!"
model = whisper.load_model(args.whisper_size)
parent_dir = "./custom_character_voice/"
speaker_names = list(os.walk(parent_dir))[0][1]
speaker_annos = []
total_files = sum([len(files) for r, d, files in os.walk(parent_dir)])
# resample audios
# 2023/4/21: Get the target sampling rate
with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f:
hps = json.load(f)
target_sr = hps['data']['sampling_rate']
processed_files = 0
for speaker in speaker_names:
for i, wavfile in enumerate(list(os.walk(parent_dir + speaker))[0][2]):
# try to load file as audio
if wavfile.startswith("processed_"):
continue
try:
wav, sr = torchaudio.load(parent_dir + speaker + "/" + wavfile, frame_offset=0, num_frames=-1, normalize=True,
channels_first=True)
wav = wav.mean(dim=0).unsqueeze(0)
if sr != target_sr:
wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=target_sr)(wav)
if wav.shape[1] / sr > 20:
print(f"{wavfile} too long, ignoring\n")
save_path = parent_dir + speaker + "/" + f"processed_{i}.wav"
torchaudio.save(save_path, wav, target_sr, channels_first=True)
# transcribe text
lang, text = transcribe_one(save_path)
if lang not in list(lang2token.keys()):
print(f"{lang} not supported, ignoring\n")
continue
text = lang2token[lang] + text + lang2token[lang] + "\n"
speaker_annos.append(save_path + "|" + speaker + "|" + text)
processed_files += 1
print(f"Processed: {processed_files}/{total_files}")
except:
continue
# # clean annotation
# import argparse
# import text
# from utils import load_filepaths_and_text
# for i, line in enumerate(speaker_annos):
# path, sid, txt = line.split("|")
# cleaned_text = text._clean_text(txt, ["cjke_cleaners2"])
# cleaned_text += "\n" if not cleaned_text.endswith("\n") else ""
# speaker_annos[i] = path + "|" + sid + "|" + cleaned_text
# write into annotation
if len(speaker_annos) == 0:
print("Warning: no short audios found, this IS expected if you have only uploaded long audios, videos or video links.")
print("this IS NOT expected if you have uploaded a zip file of short audios. Please check your file structure or make sure your audio language is supported.")
with open("short_character_anno.txt", 'w', encoding='utf-8') as f:
for line in speaker_annos:
f.write(line)
# import json
# # generate new config
# with open("./configs/finetune_speaker.json", 'r', encoding='utf-8') as f:
# hps = json.load(f)
# # modify n_speakers
# hps['data']["n_speakers"] = 1000 + len(speaker2id)
# # add speaker names
# for speaker in speaker_names:
# hps['speakers'][speaker] = speaker2id[speaker]
# # save modified config
# with open("./configs/modified_finetune_speaker.json", 'w', encoding='utf-8') as f:
# json.dump(hps, f, indent=2)
# print("finished")
+27
View File
@@ -0,0 +1,27 @@
import os
from concurrent.futures import ThreadPoolExecutor
from moviepy.editor import AudioFileClip
video_dir = "./video_data/"
audio_dir = "./raw_audio/"
filelist = list(os.walk(video_dir))[0][2]
def generate_infos():
videos = []
for file in filelist:
if file.endswith(".mp4"):
videos.append(file)
return videos
def clip_file(file):
my_audio_clip = AudioFileClip(video_dir + file)
my_audio_clip.write_audiofile(audio_dir + file.rstrip("mp4") + "wav")
if __name__ == "__main__":
infos = generate_infos()
with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:
executor.map(clip_file, infos)
+28
View File
@@ -0,0 +1,28 @@
from google.colab import files
import shutil
import os
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--type", type=str, required=True, help="type of file to upload")
args = parser.parse_args()
file_type = args.type
basepath = os.getcwd()
uploaded = files.upload() # 上传文件
assert(file_type in ['zip', 'audio', 'video'])
if file_type == "zip":
upload_path = "./custom_character_voice/"
for filename in uploaded.keys():
#将上传的文件移动到指定的位置上
shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, "custom_character_voice.zip"))
elif file_type == "audio":
upload_path = "./raw_audio/"
for filename in uploaded.keys():
#将上传的文件移动到指定的位置上
shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, filename))
elif file_type == "video":
upload_path = "./video_data/"
for filename in uploaded.keys():
# 将上传的文件移动到指定的位置上
shutil.move(os.path.join(basepath, filename), os.path.join(upload_path, filename))
+3 -2
View File
@@ -30,14 +30,15 @@ def text_to_sequence(text, symbols, cleaner_names):
return sequence
def cleaned_text_to_sequence(cleaned_text):
def cleaned_text_to_sequence(cleaned_text, symbols):
'''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
Args:
text: string to convert to a sequence
Returns:
List of integers corresponding to the symbols in the text
'''
sequence = [_symbol_to_id[symbol] for symbol in cleaned_text if symbol in _symbol_to_id.keys()]
symbol_to_id = {s: i for i, s in enumerate(symbols)}
sequence = [symbol_to_id[symbol] for symbol in cleaned_text if symbol in symbol_to_id.keys()]
return sequence
+1
View File
@@ -31,6 +31,7 @@ def korean_cleaners(text):
def chinese_cleaners(text):
'''Pipeline for Chinese text'''
text = text.replace("[ZH]", "")
text = number_to_chinese(text)
text = chinese_to_bopomofo(text)
text = latin_to_bopomofo(text)
-45
View File
@@ -1,45 +0,0 @@
0.wav|999|[ZH]所以,人的内在拥有对于人的幸福才是最关键的。[ZH]
1.wav|999|[ZH]正因为在大多数情形下人的自身内在相当贫乏,[ZH]
2.wav|999|[ZH]所以,那些再也用不着与生活的匮乏作斗争的人,[ZH]
3.wav|999|[ZH]他们之中的大多数从根本上还是感觉闷闷不乐。[ZH]
4.wav|999|[ZH]情形就跟那些还在生活的困苦中搏斗的人一般无异。[ZH]
5.wav|999|[ZH]他们内在空虚、感觉意识呆滞、思想匮乏,[ZH]
6.wav|999|[ZH]这些就驱使他们投入社交人群中。[ZH]
7.wav|999|[ZH]组成那些社交圈子的人也正是他们这一类的人。[ZH]
8.wav|999|[ZH]“因为相同羽毛的鸟聚在一块”。[ZH]
9.wav|999|[ZH]他们聚在一块追逐消遣、娱乐。[ZH]
10.wav|999|[ZH]他们以放纵感官的欢娱、极尽声色的享受开始,[ZH]
11.wav|999|[ZH]以荒唐、无度而告终。[ZH]
12.wav|999|[ZH]众多刚刚踏入生活的纨绔子弟穷奢极欲,[ZH]
13.wav|999|[ZH]在令人难以置信的极短时间内就把大部分家财挥霍殆尽。[ZH]
14.wav|999|[ZH]这种做派,其根源确实不是别的,正是无聊[ZH]
18.wav|999|[ZH]它源自上述的精神贫乏和空虚。[ZH]
16.wav|999|[ZH]一个外在富有、但内在贫乏的富家子弟来到这个世界,[ZH]
17.wav|999|[ZH]会徒劳地用外在的财富去补偿内在的不足;[ZH]
18.wav|999|[ZH]他渴望从外部得到一切,[ZH]
19.wav|999|[ZH]这情形就好比试图以少女的汗水去强健自己体魄的老朽之人。[ZH]
20.wav|999|[ZH]人自身内在的贫乏由此导致了外在财富的贫乏。[ZH]
21.wav|999|[ZH]至于另外两项人生好处的重要性,[ZH]
22.wav|999|[ZH]不需要我特别强调。[ZH]
23.wav|999|[ZH]财产的价值在当今是人所公认的,[ZH]
24.wav|999|[ZH]用不着为其宣传介绍。[ZH]
25.wav|999|[ZH]比起第二项的好处,[ZH]
26.wav|999|[ZH]第三项的好处具有一种相当飘渺的成分,[ZH]
27.wav|999|[ZH]因为名誉、名望、地位等[ZH]
28.wav|999|[ZH]全由他人的意见构成。[ZH]
29.wav|999|[ZH]每人都可以争取得到名誉,[ZH]
30.wav|999|[ZH]亦即清白的名声;[ZH]
31.wav|999|[ZH]但社会地位,则只有月盼国家政府的人才能染指;[ZH]
32.wav|999|[ZH]至于显赫的名望就只有极少数人才会得到。[ZH]
33.wav|999|[ZH]在所有这些当中,[ZH]
34.wav|999|[ZH]名誉是弥足珍贵的;[ZH]
35.wav|999|[ZH]显赫的名望则是人所希望得到的价值至昂的东西,[ZH]
36.wav|999|[ZH]那是天之骄子才能得到的金羊毛。[ZH]
37.wav|999|[ZH]另一方面,[ZH]
38.wav|999|[ZH]只有傻瓜才会把社会地位放置在财产之前。[ZH]
39.wav|999|[ZH]另外,人拥有的财产、物品和名誉、声望,[ZH]
40.wav|999|[ZH]是处于一种所谓的互为影响、促进的关系。[ZH]
41.wav|999|[ZH]彼得尼斯说过:“一个人所拥有的财产决定了这个人在他人眼中的价值”。[ZH]
42.wav|999|[ZH]如果这句话是正确的话,[ZH]
43.wav|999|[ZH]那么,反过来,他人对自己的良好评价,[ZH]
44.wav|999|[ZH]能以各种形式帮助自己获取财产。[ZH]
-45
View File
@@ -1,45 +0,0 @@
0.wav|999|swo↓↑i↓↑, ɹ`ən↑ t⁼ə neɪ↓ts⁼aɪ↓ jʊŋ→joʊ↓↑ t⁼weɪ↓ɥ↑ ɹ`ən↑ t⁼ə ʃiŋ↓fu↑ tsʰaɪ↑ s`ɹ`↓ ts⁼weɪ↓ k⁼wan→tʃ⁼jɛn↓ t⁼ə.
1.wav|999|ts`⁼əŋ↓ in→weɪ↓ ts⁼aɪ↓ t⁼a↓t⁼wo→s`u↓ tʃʰiŋ↑ʃiŋ↑ ʃja↓ɹ`ən↑ t⁼ə ts⁼ɹ↓s`ən→ neɪ↓ts⁼aɪ↓ ʃiɑŋ→t⁼ɑŋ→ pʰin↑fa↑,
2.wav|999|swo↓↑i↓↑, na↓ʃiɛ→ ts⁼aɪ↓iɛ↓↑ jʊŋ↓p⁼u↓ts`⁼ə ɥ↓↑ s`əŋ→xwo↑ t⁼ə kʰweɪ↓fa↑ ts⁼wo↓ t⁼oʊ↓ts`⁼əŋ→ t⁼ə ɹ`ən↑,
3.wav|999|tʰa→mən ts`⁼ɹ`→ts`⁼ʊŋ→ t⁼ə t⁼a↓t⁼wo→s`u↓ tsʰʊŋ↑k⁼ən→p⁼ən↓↑s`ɑŋ↓ xaɪ↑s`ɹ`↓ k⁼an↓↑tʃ⁼ɥɛ↑ mən↓mən↓p⁼u↓lə↓.
4.wav|999|tʃʰiŋ↑ʃiŋ↑ tʃ⁼joʊ↓ k⁼ən→ na↓ʃiɛ→ xaɪ↑ ts⁼aɪ↓ s`əŋ→xwo↑ t⁼ə kʰwən↓kʰu↓↑ ts`⁼ʊŋ→ p⁼wo↑t⁼oʊ↓ t⁼ə ɹ`ən↑ i↓p⁼an→ u↑i↓.
5.wav|999|tʰa→mən neɪ↓ts⁼aɪ↓ kʰʊŋ→ʃɥ→, k⁼an↓↑tʃ⁼ɥɛ↑ i↓s`ɹ`↑ t⁼aɪ→ts`⁼ɹ`↓, sɹ→ʃiɑŋ↓↑ kʰweɪ↓fa↑,
6.wav|999|ts`⁼ə↓ʃiɛ→ tʃ⁼joʊ↓ tʃʰɥ→s`ɹ`↓↑ tʰa→mən tʰoʊ↑ɹ`u↓ s`ə↓tʃ⁼iɑʊ→ ɹ`ən↑tʃʰɥn↑ ts`⁼ʊŋ→.
7.wav|999|ts⁼u↓↑ts`ʰəŋ↑ na↓ʃiɛ→ s`ə↓tʃ⁼iɑʊ→tʃʰɥæn→ts⁼ɹ t⁼ə ɹ`ən↑ iɛ↓↑ ts`⁼əŋ↓s`ɹ`↓ tʰa→mən ts`⁼ə↓ i→leɪ↓ t⁼ə ɹ`ən↑.
8.wav|999|“ in→weɪ↓ ʃiɑŋ→tʰʊŋ↑ ɥ↓↑mɑʊ↑ t⁼ə niɑʊ↓↑ tʃ⁼ɥ↓ ts⁼aɪ↓ i→kʰwaɪ↓”.
9.wav|999|tʰa→mən tʃ⁼ɥ↓ts⁼aɪ↓ i→kʰwaɪ↓ ts`⁼weɪ→ts`⁼u↑ ʃiɑʊ→tʃʰjɛn↓↑, ɥ↑lə↓.
10.wav|999|tʰa→mən i↓↑ fɑŋ↓ts⁼ʊŋ↓ k⁼an↓↑k⁼wan→ t⁼ə xwan→ɥ↑, tʃ⁼i↑tʃ⁼in↓↑ s`əŋ→sə↓ t⁼ə ʃiɑŋ↓↑s`oʊ↓ kʰaɪ→s`ɹ`↓↑,
11.wav|999|i↓↑ xuɑŋ→tʰɑŋ↑, u↑t⁼u↓ əɹ`↑ k⁼ɑʊ↓ts`⁼ʊŋ→.
12.wav|999|ts`⁼ʊŋ↓t⁼wo→ k⁼ɑŋ→k⁼ɑŋ→ tʰa↓ɹ`u↓ s`əŋ→xwo↑ t⁼ə wan↑kʰu↓ts⁼ɹ↓↑t⁼i↓ tʃʰjʊŋ↑s`ə→tʃ⁼i↑ɥ↓,
13.wav|999|ts⁼aɪ↓ liŋ↓ɹ`ən↑ nan↑i↓↑ts`⁼ɹ`↓ʃin↓ t⁼ə tʃ⁼i↑ t⁼wan↓↑s`ɹ`↑tʃ⁼jɛn→ neɪ↓ tʃ⁼joʊ↓ p⁼a↓↑ t⁼a↓p⁼u↓fən↓ tʃ⁼ja→tsʰaɪ↑ xweɪ→xwo↓ t⁼aɪ↓tʃ⁼in↓.
14.wav|999|ts`⁼ə↓ts`⁼ʊŋ↓↑ ts⁼wo↓pʰaɪ↓, tʃʰi↑ k⁼ən→ɥæn↑ tʃʰɥɛ↓s`ɹ`↑ p⁼u↑s`ɹ`↓ p⁼iɛ↑t⁼ə, ts`⁼əŋ↓s`ɹ`↓ u↑liɑʊ↑.
18.wav|999|tʰa→ ɥæn↑ts⁼ɹ↓ s`ɑŋ↓s`u↓ t⁼ə tʃ⁼iŋ→s`ən↑ pʰin↑fa↑ xə↑ kʰʊŋ→ʃɥ→.
16.wav|999|i↑k⁼ə↓ waɪ↓ ts⁼aɪ↓ fu↓joʊ↓↑, t⁼an↓ neɪ↓ts⁼aɪ↓ pʰin↑fa↑ t⁼ə fu↓tʃ⁼ja→ts⁼ɹ↓↑t⁼i↓ laɪ↑t⁼ɑʊ↓ ts`⁼ə↓k⁼ə↓ s`ɹ`↓tʃ⁼iɛ↓,
17.wav|999|xweɪ↓ tʰu↑lɑʊ↑t⁼i↓ jʊŋ↓waɪ↓ ts⁼aɪ↓ t⁼ə tsʰaɪ↑fu↓ tʃʰɥ↓ p⁼u↓↑ts`ʰɑŋ↑ neɪ↓ts⁼aɪ↓ t⁼ə p⁼u↓ts⁼u↑,
18.wav|999|tʰa→ kʰə↓↑uɑŋ↓ tsʰʊŋ↑ waɪ↓p⁼u↓ t⁼ə↑t⁼ɑʊ↓ i→tʃʰiɛ↓,
19.wav|999|ts`⁼ə↓ tʃʰiŋ↑ʃiŋ↑ tʃ⁼joʊ↓ xɑʊ↓↑p⁼i↓↑ s`ɹ`↓tʰu↑ i↓↑ s`ɑʊ↓nɥ↓↑ t⁼ə xan↓s`weɪ↓↑ tʃʰɥ↓ tʃʰiɑŋ↑tʃ⁼jɛn↓ ts⁼ɹ↓tʃ⁼i↓↑ tʰi↓↑pʰwo↓ t⁼ə lɑʊ↓↑ʃjoʊ↓↑ ts`⁼ɹ`→ ɹ`ən↑.
20.wav|999|ɹ`ən↑ ts⁼ɹ↓s`ən→ neɪ↓ts⁼aɪ↓ t⁼ə pʰin↑fa↑ joʊ↑tsʰɹ↓↑ t⁼ɑʊ↓↑ts`⁼ɹ`↓ lə waɪ↓ ts⁼aɪ↓ tsʰaɪ↑fu↓ t⁼ə pʰin↑fa↑.
21.wav|999|ts`⁼ɹ`↓ɥ↑ liŋ↓waɪ↓ liɑŋ↓↑ʃiɑŋ↓ ɹ`ən↑s`əŋ→ xɑʊ↓↑ts`ʰu↓ t⁼ə ts`⁼ʊŋ↓iɑʊ↓ʃiŋ↓,
22.wav|999|p⁼u↓ ʃɥ→iɑʊ↓ wo↓↑ tʰə↓p⁼iɛ↑tʃʰiɑŋ↑t⁼iɑʊ↓.
23.wav|999|tsʰaɪ↑ts`ʰan↓↑ t⁼ə tʃ⁼ja↓ts`⁼ɹ`↑ ts⁼aɪ↓ t⁼ɑŋ→tʃ⁼in→ s`ɹ`↓ ɹ`ən↑ swo↓↑ k⁼ʊŋ→ɹ`ən↓ t⁼ə,
24.wav|999|jʊŋ↓p⁼u↓ts`⁼ə weɪ↓ tʃʰi↑ ʃɥæn→ts`ʰwan↑ tʃ⁼iɛ↓s`ɑʊ↓.
25.wav|999|p⁼i↓↑tʃʰi↓↑ t⁼i↓əɹ`↓ʃiɑŋ↓ t⁼ə xɑʊ↓↑ts`ʰu↓,
26.wav|999|t⁼i↓san→ʃiɑŋ↓ t⁼ə xɑʊ↓↑ts`ʰu↓ tʃ⁼ɥ↓joʊ↓↑ i→ts`⁼ʊŋ↓↑ ʃiɑŋ→t⁼ɑŋ→ pʰiɑʊ→miɑʊ↓↑ t⁼ə ts`ʰəŋ↑fən↓,
27.wav|999|in→weɪ↓ miŋ↑ɥ↓, miŋ↑uɑŋ↓, t⁼i↓weɪ↓ t⁼əŋ↓↑.
28.wav|999|tʃʰɥæn↑ joʊ↑ tʰa→ɹ`ən↑ t⁼ə i↓tʃ⁼jɛn↓ k⁼oʊ↓ts`ʰəŋ↑.
29.wav|999|meɪ↓↑ɹ`ən↑ t⁼oʊ→ kʰə↓↑i↓↑ ts`⁼əŋ→tʃʰɥ↓↑ t⁼ə↑t⁼ɑʊ↓ miŋ↑ɥ↓,
30.wav|999|i↓ tʃ⁼i↑ tʃʰiŋ→p⁼aɪ↑ t⁼ə miŋ↑s`əŋ→,
31.wav|999|t⁼an↓ s`ə↓xweɪ↓ t⁼i↓weɪ↓, ts⁼ə↑ ts`⁼ɹ`↓↑joʊ↓↑ ɥɛ↓ pʰan↓ k⁼wo↑tʃ⁼ja→ ts`⁼əŋ↓fu↓↑ t⁼ə ɹ`ən↑tsʰaɪ↑ nəŋ↑ ɹ`an↓↑ts`⁼ɹ`↓↑,
32.wav|999|ts`⁼ɹ`↓ɥ↑ ʃjɛn↓↑xə↓ t⁼ə miŋ↑uɑŋ↓ tʃ⁼joʊ↓ ts`⁼ɹ`↓↑joʊ↓↑ tʃ⁼i↑s`ɑʊ↓↑s`u↓ ɹ`ən↑tsʰaɪ↑ xweɪ↓ t⁼ə↑t⁼ɑʊ↓.
33.wav|999|ts⁼aɪ↓ swo↓↑joʊ↓↑ ts`⁼ə↓ʃiɛ→ t⁼ɑŋ→ts`⁼ʊŋ→,
34.wav|999|miŋ↑ɥ↓ s`ɹ`↓ mi↑ts⁼u↑ts`⁼ən→k⁼weɪ↓ t⁼ə,
35.wav|999|ʃjɛn↓↑xə↓ t⁼ə miŋ↑uɑŋ↓ ts⁼ə↑ s`ɹ`↓ ɹ`ən↑ swo↓↑ ʃi→uɑŋ↓ t⁼ə↑t⁼ɑʊ↓ t⁼ə tʃ⁼ja↓ts`⁼ɹ`↑ ts`⁼ɹ`↓ɑŋ↑ t⁼ə t⁼ʊŋ→ʃi→,
36.wav|999|na↓ s`ɹ`↓ tʰjɛn→ts`⁼ɹ`→tʃ⁼iɑʊ→ts⁼ɹ tsʰaɪ↑nəŋ↑ t⁼ə↑t⁼ɑʊ↓ t⁼ə tʃ⁼in→ iɑŋ↑mɑʊ↑.
37.wav|999|liŋ↓i↓fɑŋ→mjɛn↓,
38.wav|999|ts`⁼ɹ`↓↑joʊ↓↑ s`a↓↑k⁼wa→ tsʰaɪ↑ xweɪ↓ p⁼a↓↑ s`ə↓xweɪ↓ t⁼i↓weɪ↓ fɑŋ↓ts`⁼ɹ`↓ ts⁼aɪ↓ tsʰaɪ↑ts`ʰan↓↑ ts`⁼ɹ`→tʃʰjɛn↑.
39.wav|999|liŋ↓waɪ↓, ɹ`ən↑ jʊŋ→joʊ↓↑ t⁼ə tsʰaɪ↑ts`ʰan↓↑, u↓pʰin↓↑ xə↑ miŋ↑ɥ↓, s`əŋ→uɑŋ↓,
40.wav|999|s`ɹ`↓ ts`ʰu↓↑ɥ↑ i→ts`⁼ʊŋ↓↑ swo↓↑weɪ↓ t⁼ə xu↓weɪ↓ iŋ↓↑ʃiɑŋ↓↑, tsʰu↓tʃ⁼in↓ t⁼ə k⁼wan→ʃi↓.
41.wav|999|p⁼i↓↑t⁼ə↑ ni↑sɹ→ s`wo→ k⁼wo↓,“ i↑k⁼ə↓ ɹ`ən↑ swo↓↑ jʊŋ→joʊ↓↑ t⁼ə tsʰaɪ↑ts`ʰan↓↑ tʃ⁼ɥɛ↑t⁼iŋ↓ lə ts`⁼ə↓k⁼ə↓ ɹ`ən↑ ts⁼aɪ↓ tʰa→ɹ`ən↑ jɛn↓↑ts`⁼ʊŋ→ t⁼ə tʃ⁼ja↓ts`⁼ɹ`↑”.
42.wav|999|ɹ`u↑k⁼wo↓↑ ts`⁼ə↓tʃ⁼ɥ↓ xwa↓ s`ɹ`↓ ts`⁼əŋ↓tʃʰɥɛ↓ t⁼əxwa↓,
43.wav|999|na↓mə, fan↓↑k⁼wo↓laɪ↑, tʰa→ɹ`ən↑ t⁼weɪ↓ ts⁼ɹ↓tʃ⁼i↓↑ t⁼ə liɑŋ↑xɑʊ↓↑ pʰiŋ↑tʃ⁼ja↓,
44.wav|999|nəŋ↑i↓↑ k⁼ə↓ts`⁼ʊŋ↓↑ ʃiŋ↑s`ɹ`↓ p⁼ɑŋ→ts`⁼u↓ ts⁼ɹ↓tʃ⁼i↓↑ xwo↓tʃʰɥ↓↑ tsʰaɪ↑ts`ʰan↓↑.
-72
View File
@@ -1,72 +0,0 @@
import numpy as np
import torch
import torchaudio
import gradio as gr
import os
anno_lines = []
with open("./user_voice/user_voice.txt", 'r', encoding='utf-8') as f:
for line in f.readlines():
anno_lines.append(line.strip("\n"))
text_index = 0
def display_text(index):
index = int(index)
global text_index
text_index = index
return f"{text_index}: " + anno_lines[index].split("|")[2].strip("[ZH]")
def display_prev_text():
global text_index
if text_index != 0:
text_index -= 1
return f"{text_index}: " + anno_lines[text_index].split("|")[2].strip("[ZH]")
def display_next_text():
global text_index
if text_index != len(anno_lines)-1:
text_index += 1
return f"{text_index}: " + anno_lines[text_index].split("|")[2].strip("[ZH]")
def save_audio(audio):
global text_index
if audio:
sr, wav = audio
wav = torch.tensor(wav).type(torch.float32) / max(wav.max(), -wav.min())
wav = wav.unsqueeze(0) if len(wav.shape) == 1 else wav
if sr != 22050:
res_wav = torchaudio.transforms.Resample(orig_freq=sr, new_freq=22050)(wav)
else:
res_wav = wav
torchaudio.save(f"./user_voice/{str(text_index)}.wav", res_wav, 22050, channels_first=True)
return f"Audio saved to ./user_voice/{str(text_index)}.wav successfully!"
else:
return "Error: Please record your audio!"
if __name__ == "__main__":
app = gr.Blocks()
with app:
with gr.Row():
text = gr.Textbox(value="0: " + anno_lines[0].split("|")[2].strip("[ZH]"), label="Please read the text here")
with gr.Row():
audio_to_collect = gr.Audio(source="microphone")
with gr.Row():
with gr.Column():
prev_btn = gr.Button(value="Previous")
with gr.Column():
next_btn = gr.Button(value="Next")
with gr.Row():
index_dropdown = gr.Dropdown(choices=[str(i) for i in range(len(anno_lines))], value="0",
label="No. of text", interactive=True)
with gr.Row():
with gr.Column():
save_btn = gr.Button(value="Save Audio")
with gr.Column():
audio_save_message = gr.Textbox(label="Message")
index_dropdown.change(display_text, inputs=index_dropdown, outputs=text)
prev_btn.click(display_prev_text, inputs=None, outputs=text)
next_btn.click(display_next_text, inputs=None, outputs=text)
save_btn.click(save_audio, inputs=audio_to_collect, outputs=audio_save_message)
app.launch()
+178 -10
View File
@@ -8,6 +8,7 @@ import subprocess
import numpy as np
from scipy.io.wavfile import read
import torch
import regex as re
MATPLOTLIB_FLAG = False
@@ -15,7 +16,136 @@ logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logger = logging
def load_checkpoint(checkpoint_path, model, optimizer=None):
zh_pattern = re.compile(r'[\u4e00-\u9fa5]')
en_pattern = re.compile(r'[a-zA-Z]')
jp_pattern = re.compile(r'[\u3040-\u30ff\u31f0-\u31ff]')
kr_pattern = re.compile(r'[\uac00-\ud7af\u1100-\u11ff\u3130-\u318f\ua960-\ua97f]')
num_pattern=re.compile(r'[0-9]')
comma=r"(?<=[.。!??;;,、:'\"‘“”’()()《》「」~——])" #向前匹配但固定长度
tags={'ZH':'[ZH]','EN':'[EN]','JP':'[JA]','KR':'[KR]'}
def tag_cjke(text):
'''为中英日韩加tag,中日正则分不开,故先分句分离中日再识别,以应对大部分情况'''
sentences = re.split(r"([.。!??;;,、:'\"‘“”’()()【】《》「」~——]+ *(?![0-9]))", text) #分句,排除小数点
sentences.append("")
sentences = ["".join(i) for i in zip(sentences[0::2],sentences[1::2])]
# print(sentences)
prev_lang=None
tagged_text = ""
for s in sentences:
#全为符号跳过
nu = re.sub(r'[\s\p{P}]+', '', s, flags=re.U).strip()
if len(nu)==0:
continue
s = re.sub(r'[()()《》「」【】‘“”’]+', '', s)
jp=re.findall(jp_pattern, s)
#本句含日语字符判断为日语
if len(jp)>0:
prev_lang,tagged_jke=tag_jke(s,prev_lang)
tagged_text +=tagged_jke
else:
prev_lang,tagged_cke=tag_cke(s,prev_lang)
tagged_text +=tagged_cke
return tagged_text
def tag_jke(text,prev_sentence=None):
'''为英日韩加tag'''
# 初始化标记变量
tagged_text = ""
prev_lang = None
tagged=0
# 遍历文本
for char in text:
# 判断当前字符属于哪种语言
if jp_pattern.match(char):
lang = "JP"
elif zh_pattern.match(char):
lang = "JP"
elif kr_pattern.match(char):
lang = "KR"
elif en_pattern.match(char):
lang = "EN"
# elif num_pattern.match(char):
# lang = prev_sentence
else:
lang = None
tagged_text += char
continue
# 如果当前语言与上一个语言不同,就添加标记
if lang != prev_lang:
tagged=1
if prev_lang==None: # 开头
tagged_text =tags[lang]+tagged_text
else:
tagged_text =tagged_text+tags[prev_lang]+tags[lang]
# 重置标记变量
prev_lang = lang
# 添加当前字符到标记文本中
tagged_text += char
# 在最后一个语言的结尾添加对应的标记
if prev_lang:
tagged_text += tags[prev_lang]
if not tagged:
prev_lang=prev_sentence
tagged_text =tags[prev_lang]+tagged_text+tags[prev_lang]
return prev_lang,tagged_text
def tag_cke(text,prev_sentence=None):
'''为中英韩加tag'''
# 初始化标记变量
tagged_text = ""
prev_lang = None
# 是否全略过未标签
tagged=0
# 遍历文本
for char in text:
# 判断当前字符属于哪种语言
if zh_pattern.match(char):
lang = "ZH"
elif kr_pattern.match(char):
lang = "KR"
elif en_pattern.match(char):
lang = "EN"
# elif num_pattern.match(char):
# lang = prev_sentence
else:
# 略过
lang = None
tagged_text += char
continue
# 如果当前语言与上一个语言不同,添加标记
if lang != prev_lang:
tagged=1
if prev_lang==None: # 开头
tagged_text =tags[lang]+tagged_text
else:
tagged_text =tagged_text+tags[prev_lang]+tags[lang]
# 重置标记变量
prev_lang = lang
# 添加当前字符到标记文本中
tagged_text += char
# 在最后一个语言的结尾添加对应的标记
if prev_lang:
tagged_text += tags[prev_lang]
# 未标签则继承上一句标签
if tagged==0:
prev_lang=prev_sentence
tagged_text =tags[prev_lang]+tagged_text+tags[prev_lang]
return prev_lang,tagged_text
def load_checkpoint(checkpoint_path, model, optimizer=None, drop_speaker_emb=False):
assert os.path.isfile(checkpoint_path)
checkpoint_dict = torch.load(checkpoint_path, map_location='cpu')
iteration = checkpoint_dict['iteration']
@@ -31,8 +161,10 @@ def load_checkpoint(checkpoint_path, model, optimizer=None):
for k, v in state_dict.items():
try:
if k == 'emb_g.weight':
if drop_speaker_emb:
new_state_dict[k] = v
continue
v[:saved_state_dict[k].shape[0], :] = saved_state_dict[k]
# v[999, :] = saved_state_dict[k][154, :]
new_state_dict[k] = v
else:
new_state_dict[k] = saved_state_dict[k]
@@ -72,14 +204,29 @@ def summarize(writer, global_step, scalars={}, histograms={}, images={}, audios=
writer.add_audio(k, v, global_step, audio_sampling_rate)
def latest_checkpoint_path(dir_path, regex="G_*.pth"):
def extract_digits(f):
digits = "".join(filter(str.isdigit, f))
return int(digits) if digits else -1
def latest_checkpoint_path(dir_path, regex="G_[0-9]*.pth"):
f_list = glob.glob(os.path.join(dir_path, regex))
f_list.sort(key=lambda f: int("".join(filter(str.isdigit, f))))
f_list.sort(key=lambda f: extract_digits(f))
x = f_list[-1]
print(x)
print(f"latest_checkpoint_path:{x}")
return x
def oldest_checkpoint_path(dir_path, regex="G_[0-9]*.pth", preserved=4):
f_list = glob.glob(os.path.join(dir_path, regex))
f_list.sort(key=lambda f: extract_digits(f))
if len(f_list) > preserved:
x = f_list[0]
print(f"oldest_checkpoint_path:{x}")
return x
return ""
def plot_spectrogram_to_numpy(spectrogram):
global MATPLOTLIB_FLAG
if not MATPLOTLIB_FLAG:
@@ -146,14 +293,31 @@ def load_filepaths_and_text(filename, split="|"):
return filepaths_and_text
def str2bool(v):
if isinstance(v, bool):
return v
if v.lower() in ('yes', 'true', 't', 'y', '1'):
return True
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
return False
else:
raise argparse.ArgumentTypeError('Boolean value expected.')
def get_hparams(init=True):
parser = argparse.ArgumentParser()
parser.add_argument('-c', '--config', type=str, default="./configs/finetune_speaker.json",
parser.add_argument('-c', '--config', type=str, default="./configs/modified_finetune_speaker.json",
help='JSON file for configuration')
parser.add_argument('-m', '--model', type=str, default="pretrained_models",
help='Model name')
parser.add_argument('-n', '--n_steps', type=int, default="2000",
help='finetune steps')
parser.add_argument('-n', '--max_epochs', type=int, default=50,
help='finetune epochs')
parser.add_argument('--cont', type=str2bool, default=False, help='whether to continue training on the latest checkpoint')
parser.add_argument('--drop_speaker_embed', type=str2bool, default=False, help='whether to drop existing characters')
parser.add_argument('--train_with_pretrained_model', type=str2bool, default=True,
help='whether to train with pretrained model')
parser.add_argument('--preserved', type=int, default=4,
help='Number of preserved models')
args = parser.parse_args()
model_dir = os.path.join("./", args.model)
@@ -175,7 +339,11 @@ def get_hparams(init=True):
hparams = HParams(**config)
hparams.model_dir = model_dir
hparams.n_steps = args.n_steps
hparams.max_epochs = args.max_epochs
hparams.cont = args.cont
hparams.drop_speaker_embed = args.drop_speaker_embed
hparams.train_with_pretrained_model = args.train_with_pretrained_model
hparams.preserved = args.preserved
return hparams
@@ -227,7 +395,7 @@ def get_logger(model_dir, filename="train.log"):
formatter = logging.Formatter("%(asctime)s\t%(name)s\t%(levelname)s\t%(message)s")
if not os.path.exists(model_dir):
os.makedirs(model_dir)
h = logging.FileHandler(os.path.join(model_dir, filename))
h = logging.FileHandler(os.path.join(model_dir, filename),encoding="utf-8")
h.setLevel(logging.DEBUG)
h.setFormatter(formatter)
logger.addHandler(h)