A new study from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) demonstrates a novel method for training AI models using synthetic data generated by other AI models. The research shows that this approach, termed 'model-generated data training,' can be surprisingly effective for certain tasks, particularly in natural language processing. The team trained a large …
A new study from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) demonstrates a novel method for training AI models using synthetic data generated by other AI models. The research shows that this approach, termed ‘model-generated data training,’ can be surprisingly effective for certain tasks, particularly in natural language processing. The team trained a large language model on a dataset entirely produced by a previous, smaller model. They found that the new model could match or even exceed the performance of models trained on human-generated data for specific benchmarks, while significantly reducing the reliance on vast, manually curated datasets. This method could lower the cost and computational resources needed for AI development and help address data privacy concerns. However, the researchers caution that the technique works best for well-defined tasks and that performance can degrade if the synthetic data becomes too repetitive or loses fidelity over multiple generations. The full study is available in the latest issue of Science Robotics. Read the full article at https://technologyreview.com/2024/05/15/1099876/ai-training-synthetic-data-mit.
Join the Club
Like this story? You’ll love our Bi-Weekly Newsletter



