It's derived from the SEC 10K dataset of Kogan, Levin, Routledge, Sagan and Smith 2009: www.ark.cs.cmu.edu/10K/