In order to prepare the 20M dataset, restructure the data pipeline, to write examples directly to disk.