Summary#

Data Sets#

The data sets used for training generative AI models, especially those that produce images, usually contain content scraped from the internet. Much of the data harvested is copyrighted, or is otherwise intellectual property that should be protected from copying.

When this data is used:

  • Explicit consent is usually not provided by artists.

  • Artists are not informed that their art is included.

  • Artists may not now how to prove their intellectual property rights have been infringed by generative AI models.

  • These models may learn to directly replicate elements from their input data.

Bias#

These data sets are also likely to contain various biases that exist within society. Training our models on this data may reinforce and perpetuate these biases unless (or even if) they are carefully addressed during the development of the model.

This is closely related to the week 9 content on algorithmic bias.