There have been attempts to record very large databases of natural speech: either large amounts of data in the same basic style [14] or large mounts of data collected in different situations thus in varied style [15]. Such recordings are non-trivial tasks and a substantial amount of acoustic normalization is required in order to allow them to be used reasonably for sub-word unit selection. If the databases are not appropriately normalized joins will be very obvious when units are selected from different parts of the database or selection will be limited by the different recording conditions, style of speaker etc.
Recording all styles naturally would take a very long time, while prescribing styles is also very hard. When one voice talent delivered a 120 utterance shouting database, it was hard for them to speak normally for the following two days.
It is clear that current unit selection techniques work very well for limited styles and for particular applications this may be sufficient, but it is clear unit selection in its current state does not give us the flexibility we have in a human voice.