There exists a lot of datasets of faces, but most have restrictions on them. A great place to find images is on Pexels, as they are free to use (see license here).
Also, the Python library pexels-api makes it easy to download a lot of images. It can be installed by the following command.
pip install pexels-api
To use the Pexels API you need to register.
Then you can download images by a search query from this Python program.
from pexels_api import API import requests import os.path from pathlib import Path path = 'pics' Path(path).mkdir(parents=True, exist_ok=True) # To get key: sign up for pexels https://www.pexels.com/join/ # Reguest key : https://www.pexels.com/api/ # - No need to set URL # - Accept email send to you # - Refresh API or see key here: https://www.pexels.com/api/new/ PEXELS_API_KEY = '--- INSERT YOUR API KEY HERE ---' api = API(PEXELS_API_KEY) query = 'person' api.search(query) # Get photo entries photos = api.get_entries() print("Search: ", query) print("Total results: ", api.total_results) MAX_PICS = 1000 print("Fetching max: ", MAX_PICS) count = 0 while True: photos = api.get_entries() print(len(photos)) if len(photos) == 0: break for photo in photos: # Print photographer print('Photographer: ', photo.photographer) # Print original size url print('Photo original size: ', photo.original) file = os.path.join(path, query + '-' + str(count).zfill(5) + '.' + photo.original.split('.')[-1]) count += 1 print(file) picture_request = requests.get(photo.original) if picture_request.status_code == 200: with open(file, 'wb') as f: f.write(picture_request.content) # This should be a function call to make a return if count >= MAX_PICS: break if count >= MAX_PICS: break if not api.has_next_page: print("Last page: ", api.page) break # Search next page api.search_next_page()
There is an upper limit of 1.000 photos in the above Python program, you can change that if you like. It is set to download photos that are shown if you query person. Feel free to change that.
It takes some time to download all the images and will take up some space.
Here OpenCV comes in. They have trained model using the Haar Cascade Classifier. You need to install the OpenCV library by the following command.
pip install opencv-python
The trained model we use is part of the library, but is not loaded easily from the destination. Therefore we suggest you download it from here (it should be named: haarcascade_frontalface_default.xml) and add the it to the location you work from.
We want to use it to identify faces and extract them and save them in a library for later use.
import cv2 import numpy as np import glob import os from pathlib import Path def preprocess(box_width=12, box_height=16): path = "pics" output = "small-faces" Path(output).mkdir(parents=True, exist_ok=True) files = glob.glob(os.path.join(path, "*")) files.sort() face_cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml") images =  cnt = 0 for filename in files: print("Processing...", filename) frame = cv2.imread(filename) frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) frame_gray = cv2.equalizeHist(frame_gray) faces = face_cascade.detectMultiScale(frame_gray, scaleFactor=1.3, minNeighbors=10, minSize=(350, 350), flags=cv2.CASCADE_SCALE_IMAGE) for (x, y, w, h) in faces: roi = frame[y:y+h, x:x+w] img = cv2.resize(roi, (box_width, box_height)) images.append(img) output_file_name = "face-" + str(cnt).zfill(5) + ".jpg" output_file_name = os.path.join(output, output_file_name) cv2.imwrite(output_file_name, img) return np.stack(images) preprocess(box_width=12, box_height=16)
It will create a folder called small-faces with small images of the identified faces.
Notice, that the Haar Cascade Classifier is not perfect. It will miss a lot of faces and have false positives. It is a good idea to look manually though all the images and delete all false positives (images that are not having a face).
The approach to divide the photo into equal sized boxes. For each box to find the image (our faces), which fits the best as a replacement.
To improve performance of the process function we use Numba, which is a just-in-time compiler that is designed to optimize NumPy code in for-loops.
import cv2 import numpy as np import glob import os from numba import jit @jit(nopython=True) def process(photo, images, box_width=24, box_height=32): height, width, _ = photo.shape for i in range(0, height, box_height): for j in range(0, width, box_width): roi = photo[i:i + box_height, j:j + box_width] best_match = np.inf best_match_index = 0 for k in range(1, images.shape): total_sum = np.sum(np.where(roi > images[k], roi - images[k], images[k] - roi)) if total_sum < best_match: best_match = total_sum best_match_index = k photo[i:i + box_height, j:j + box_width] = images[best_match_index] return photo def main(): photo = cv2.imread("rune.jpg") box_width = 12 box_height = 16 height, width, _ = photo.shape # To make sure that it we can slice the photo in box-sizes width = (width//box_width) * box_width height = (height//box_height) * box_height photo = cv2.resize(photo, (width, height)) # Load all the images of the faces images = load_images(box_width, box_height) # Create the mosaic mosaic = process(photo.copy(), images, box_width, box_height) cv2.imshow("Original", photo) cv2.imshow("Result", mosaic) cv2.waitKey(0) main()
To test it we have used the photo of Rune.
This reuses the same images. This gives a decent result, but if you want to avoid the extreme patterns of reused images, you can change the code for that.
The above example has 606 small images. If you avoid reuse it runs out fast of possible images. This would require a bigger base or the result becomes questionable.
The above photo mosaic is created on a downscaled size, but still it does not create a good result, if you do not reuse images. This would require a quite larger set of images to work from.
Build and Deploy an AI App with Python Flask, OpenAI API, and Google Cloud: In…
Python REST APIs with gcloud Serverless In the fast-paced world of application development, building robust…
App Development with Python using Docker Are you an aspiring app developer looking to level…
Why Value-driven Data Science is the Key to Your Success In the world of data…
Harnessing the Power of Project-Based Learning and Python for Machine Learning Mastery In today's data-driven…
Is Python the right choice for Machine Learning? Should you learn Python for Machine Learning?…
How would you recommend avoiding reuse? Remove the image from the collection after it's used?
That is a good question.
The normal way to avoid reuse of images is to keep a list of a hash-value for the images.
It could be the md5 of the images. Then for every image you have the md5.