Otoing Tips
Otoing is a pain and takes a long time, even if you're quick at it. You won't be able to do it all in one sitting, and attempting that will probably only lead to misery. I usually take a week or two to do one pitch working for about an hour or two a day (I usually don't oto every day though). The CV_CVC, CVC_CV, and CV otos are the largest (I think in that order), and will take a few hours each to complete. The other files don't take as long, probably under an hour for each. Plan accordingly. Don't give up.
As usual, if you haven't already watched CZ's tutorial video on otoing, please do so. I also would recommend leaving the video open while otoing, so you can refer back as you need to while you oto.
CZ's otoing tutorial with SetParam: https://youtu.be/t_epqCMN7bQ
I'm not going to go into every type of different oto, because that would take far too long and it would be repeating the work of others. I'll just say generally some things that may be useful to keep in mind for otoing VCCV. Anyone who's spent a lot of time otoing voicebanks will probably find this information redundant.
This is setparam, something you should have seen from CZ's videos (at least) by now. Some people just use the UTAU program to oto, I haven't done that myself, but the principles should be the same. There are 5 different parameters, the left blank (L), the overlap (Ovl), the preutterance (Pre), consonant (Con), and the right blank (R). Anything that falls within the left and right blank will be what the oto consists of, so anything you don't want in the oto, make sure to block out with the left and right blanks. In the above picture the oto is 'ngz' so the end of the vowel (1) is left of the left blank, and the consonant for the next word (k) is to the right of the right blank. Anything between the left blank and Con will be said once, and anything between Con and R (the right blank) will be looped. In this case, 'ngz' is an ending, so all we want is for 'ngz' to be said, and then silence to loop afterwards.
To be honest, I don't have an all encompassing understanding of how the overlap and preutterance work- all I know is that region is what is cross faded into the previous envelope. UTAU does some calculations to figure out how much overlap there should be, and how those calculations work is pretty opaque- just follow what CZ says about keeping the overlap value as half of the preutterance- this basically games UTAU into always crossfading the current sound at the end of the last one. If you really want to know more, you can check this out. Basically, follow CZ's recommendations on where to place the preutterance.
The Consonants
This may be unnecessary, but I found it helpful to know what the various consonants look like.
I won't show images of all of the consonants, but there are some special ones that you should be aware of in order to oto correctly. CZ also shows images of all of the consonants at the end of her video. I tend to classify them into different types, and think of them that way. There are roughly 4 different types that look similar to each other, I'll say which when I get to them. All images come from me otoing my own voicebank (so these are my pronunciations, maybe other people's will look slightly different).
The first consonant you should know is 'k.' In the reclist, it's typically used to indicate the start of the next word and to show distinctness from the last consonant that was said. It's used in many many recordings. The exception to this is when a word ends with a k, if that happens, then the next word will start with a t (this is to prevent slurring the consonant in recording). If you happen to listen to your voicebank without any otoing, you'll probably hear your word endings being pronounced as 'k's. A 'k' is a short consonant typically showing about 2 to 3 spikes in the spectrum (the colorful part). It also generally has a wispy part after the spikes, similar to an 'h.'
There are a few other "short" consonants other than 'k', and they look pretty similar. They would be b, d, t, and g. Their images are shown below in that order.
The next type are pretty easy to recognize. There are "loud" consonants which are basically large angry pockets of sound. These would be sounds like ch, zh, j, s, z, sh, sk, etc. There are quite a few of these.
Next are the sort of "wispy" consonants, that contain a lot of air. These are things like h, v, and f. P is also included, but 'p's tend to have an extra wavy waveform from air being blown into the microphone ('f's also sometimes have this). They're shown below in the order of h,v,f, and p.
The last type can be the most difficult to see and oto. I tend to call them "zipper" consonants, because they generally look like the vowel is being "unzipped." These can be small, and sometimes they are completely indistinguishable from the vowels- just do your best to approximate where they are. It's o.k if you accidentally grab a part of the vowel, vowels blend a little better than you might think, so unless you really just grab a ton, you probably wont be able to hear the vowel. Shown below in order are w,r,l, and y. Y's are the most obvious, the other 3 can be really small and hard to see. Use your best judgement.
To conclude the section on consonants, there's really only one more thing to say, and that is that 'n's and 'm's look identical. They pretty much always look like flat lines on the spectrum.
Sample Otos
Again, I'm not going to tell you is how to oto every single type of file there is- CZ already does that in her tutorial. There are some common patterns to learn when otoing the different oto types, and once you learn those otoing will become relatively quick. These are just a few examples of some of the patterns. Generally just pay attention to what the alias is and change the parameters to match.
This pattern pertains to anything with an underscore. An example of this sort of alias would be _A or _ba. Basically you want the part between the end of consonant and the start of the vowel between the left blank and the preutterance. These should be very small. This sort of oto is found in _CV and V.
I call these "floating" otos. They can happen right before the next to last word or at the end of the last word. Their aliases are typically like ab,ab-, bd, or bd-. This is a pretty common oto. L and Pre should contain whatever is in the alias before the consonant (nothing, if it's nothing, a vowel if it's a vowel), and between the pre and con should be the consonant. Between Con and R should be silence. R should come right before the consonant of the next word (unless it's an ending like bd-, then there is no next word). This type is found in CC-, CV_CVC, VCC,
VCC has this oto, but there are some important things to pay attention to. For an alias like 1ngz, notice that there is a vowel in that alias so part of the vowel should be included in that oto (left). For ngz, however, there is no vowel, so make sure the left blank is blocking out any part of the vowel (it's usually oversized in my experience) (right picture). The Pre in this case comes at the end of the first consonant and before the second- CZ goes over that. Like I said before- pay attention to what's in the alias, and put that in-between the left blank and pre,and put the "main consonant" after the pre.
That's all I'm going to say about otoing. If you'd like to take a look at tips for making a VCCV UST, you can click here.