I have this side project which, on Reddit, would be considered DataHoarding and altruistically would be called digital archival preservation. In brief, it is downloading images from online catalogs to preserve them in the event the online store goes out of business, which would result in those images being lost forever. Pre-Internet, it would the equivalent of saving and preserving print catalogs from stores, but no one prints catalogs anymore.
With that explanation made, this is a project I have invested years of time into and over those years, I have developed software that assists me in my tasks. Initially, the software was a way to stitch large images together from partial tiles when a website wouldn’t allow you download the full resolution image. Then it expanded to automating downloading, with subsequent performance improvements along the way, then the tools for selection improved. But through all this, the process was still manual. I actually viewed every single image that was downloaded and made the determination as to whether to keep it or not. To give you an idea of how many images I have viewed, I download images in batches, usually a sequence 100,000 large. That yields maybe 80,000 images. After filtering and selection, I have maybe 25,000 images. In my collection, I have over 1.6 million images. That’s 1.6M images after a 30% keep ratio. Math that out to figure out how many images I have seen.
To be fair, I’ve gotten extremely fast at selecting images. More accurately, I have gotten very fast at spotting images that need to be filtered out. I can view a page of 60 images and within a couple of seconds determine if the whole page can be downloaded or if it needs filtered. This comes from years of seeing the same images over and over and when something looks out of place, it fires a stop response.
But how fast could this process go if I could just take the identification skill I have built up over years and turn it into code? Well, that day has arrived with ML.NET. I am attempting to implement AI into my utility to do some of the easy selection for me and also do some of the easy weeding for me. The middle ground I’ll still have to process myself, but having any help is welcome. After all, this is only a one man operation.
To begin, I modified my download utility to save the images I was filtering in two folders: Keep and Discard. After a short run, I ended up with about 12k images totaling 386MB, containing about twice as many discards as keeps. (I did say I was fast.) I used VisiPics to remove any duplicates and near-duplicates so hopefully the AI will have more unique imagery to work with. Then I used the Model Builder tool to train the AI as to what was a good image and what was not. I trained on my local machine, which is reported as an “AMD Ryzen 5 1600 Six-core processor” with 16GB of RAM. The tool ran through all the images, did some processing, then ran through them again and did another computation, then a third time, and a fourth, a fifth, sixth, seventh, eighth. Each pass takes a little over half an hour. The process is using about 75% of my CPU consistently. After the 3rd pass, I saw mentions that it had a 99% accuracy rate. If so, that’s amazing. I’m trying to make sense of the results each time and there’s one number that seems to be decrementing each pass. I hope that can’t be considered a countdown to zero because it’s at 116 after the 7th pass and that suggests I have two more full days of training.
As this process is in its 4th hour, I’ve been reading all the stuff I should have read before starting. First, you should have an equal number of images in each category, where I have more Discards than Keeps. Second, it seems that 1000 images per category should be sufficient. Having 12k total images is definitely working against me here. But, my model should be pretty broad to capture all the different images I could come across. Next time, I will use VisiPics to narrow down the similar items even further.
It finished in 5 hours with an accuracy of 99.63%. The model ended up being a 93MB ZIP file. Now to test it out. After a small struggle as to what references I had to apply to my test application, I set it up to run through my previously downloaded queue of images, copying the “Keeps” to one folder and the “Discards” to another. Then I opened both folders to see the results and let it rip. How are the results? So far, perfection. Not a single miscategorized image. I’m not taking the confidence level into account, only the Keep/Discard determination.
Now, while its accuracy is astounding, its speed is quite underwhelming. This is cutting-edge tech so maybe I’m asking too much, but it’s taking 4 seconds to process one image. Also, the dependencies are a little heavy. My little test app is over 100MB in libraries, plus the 90MB model. 222MB in total. CPU usage is 10-15% while processing an image. RAM usage is a bit more obscene, using up to 2GB of RAM when processing an image. Granted, the memory is released right away, but that’s quite a bite for each image. 4 seconds per image is simply not doable.
I examined the generated code and found that I was calling a method that would load the model every call and create the prediction algorithm each time. So I moved that code into my main program and organized it so it would only load the model once. Holy crap, what a speed difference. It’s about 160ms to process an image once the model is loaded and ready to go. This is now absolutely doable.
With the higher volume, I began to see more inaccuracy, so it was time to take the confidence level into account. If the result had less than a 90% confidence, I moved it to a third folder called “Maybe”. In the final implementation, these would be ones that were neither selected nor deleted. I would manually review them.
After seeing the >90% keeps and discards, I had the confidence myself to run it on my download queue for real. And I sat back and watched as it selected and discarded all by itself. Absolutely amazing. Am I faster? Sure, but I do need to stop every once in a while for rest or food or just from burnout. The AI doesn’t have to.
The last major enhancement a had added to this utility was a “lasso” feature for selecting or discarding images and my throughput increased by an insane amount. Afterwards, I wondered why I ever waited so long to implement that feature – it was a game-changer. This new AI feature is going to be another game-changer on a whole new level.