First Step Into A.I. using ML.NET

I have this side project which, on Reddit, would be considered DataHoarding and altruistically would be called digital archival preservation.  In brief, it is downloading images from online catalogs to preserve them in the event the online store goes out of business, which would result in those images being lost forever.  Pre-Internet, it would the equivalent of saving and preserving print catalogs from stores, but no one prints catalogs anymore.

With that explanation made, this is a project I have invested years of time into and over those years, I have developed software that assists me in my tasks.  Initially, the software was a way to stitch large images together from partial tiles when a website wouldn’t allow you download the full resolution image.  Then it expanded to automating downloading, with subsequent performance improvements along the way, then the tools for selection improved.   But through all this, the process was still manual.  I actually viewed every single image that was downloaded and made the determination as to whether to keep it or not.  To give you an idea of how many images I have viewed, I download images in batches, usually a sequence 100,000 large.  That yields maybe 80,000 images.  After filtering and selection, I have maybe 25,000 images.  In my collection, I have over 1.6 million images.  That’s 1.6M images after a 30% keep ratio.  Math that out to figure out how many images I have seen.

To be fair, I’ve gotten extremely fast at selecting images.  More accurately, I have gotten very fast at spotting images that need to be filtered out.  I can view a page of 60 images and within a couple of seconds determine if the whole page can be downloaded or if it needs filtered.  This comes from years of seeing the same images over and over and when something looks out of place, it fires a stop response.

But how fast could this process go if I could just take the identification skill I have built up over years and turn it into code?  Well, that day has arrived with ML.NET.  I am attempting to implement AI into my utility to do some of the easy selection for me and also do some of the easy weeding for me.  The middle ground I’ll still have to process myself, but having any help is welcome.  After all, this is only a one man operation.

To begin, I modified my download utility to save the images I was filtering in two folders: Keep and Discard.  After a short run, I ended up with about 12k images totaling 386MB, containing about twice as many discards as keeps. (I did say I was fast.)  I used VisiPics to remove any duplicates and near-duplicates so hopefully the AI will have more unique imagery to work with.  Then I used the Model Builder tool to train the AI as to what was a good image and what was not.  I trained on my local machine, which is reported as an “AMD Ryzen 5 1600 Six-core processor” with 16GB of RAM.  The tool ran through all the images, did some processing, then ran through them again and did another computation, then a third time, and a fourth, a fifth, sixth, seventh, eighth.  Each pass takes a little over half an hour.  The process is using about 75% of my CPU consistently.  After the 3rd pass, I saw mentions that it had a 99% accuracy rate.  If so, that’s amazing.  I’m trying to make sense of the results each time and there’s one number that seems to be decrementing each pass.  I hope that can’t be considered a countdown to zero because it’s at 116 after the 7th pass and that suggests I have two more full days of training.

As this process is in its 4th hour, I’ve been reading all the stuff I should have read before starting.  First, you should have an equal number of images in each category, where I have more Discards than Keeps.  Second, it seems that 1000 images per category should be sufficient.  Having 12k total images is definitely working against me here.  But, my model should be pretty broad to capture all the different images I could come across.  Next time, I will use VisiPics to narrow down the similar items even further.

It finished in 5 hours with an accuracy of 99.63%.  The model ended up being a 93MB ZIP file.  Now to test it out.  After a small struggle as to what references I had to apply to my test application, I set it up to run through my previously downloaded queue of images, copying the “Keeps” to one folder and the “Discards” to another.  Then I opened both folders to see the results and let it rip.  How are the results?  So far, perfection.  Not a single miscategorized image.  I’m not taking the confidence level into account, only the Keep/Discard determination.

Now, while its accuracy is astounding, its speed is quite underwhelming.  This is cutting-edge tech so maybe I’m asking too much, but it’s taking 4 seconds to process one image.  Also, the dependencies are a little heavy.  My little test app is over 100MB in libraries, plus the 90MB model.  222MB in total.  CPU usage is 10-15% while processing an image.  RAM usage is a bit more obscene, using up to 2GB of RAM when processing an image.  Granted, the memory is released right away, but that’s quite a bite for each image.  4 seconds per image is simply not doable.

I examined the generated code and found that I was calling a method that would load the model every call and create the prediction algorithm each time.  So I moved that code into my main program and organized it so it would only load the model once.  Holy crap, what a speed difference.  It’s about 160ms to process an image once the model is loaded and ready to go.  This is now absolutely doable.

With the higher volume, I began to see more inaccuracy, so it was time to take the confidence level into account.  If the result had less than a 90% confidence, I moved it to a third folder called “Maybe”.  In the final implementation, these would be ones that were neither selected nor deleted.  I would manually review them.

After seeing the >90% keeps and discards, I had the confidence myself to run it on my download queue for real.  And I sat back and watched as it selected and discarded all by itself.  Absolutely amazing.  Am I faster?  Sure, but I do need to stop every once in a while for rest or food or just from burnout.  The AI doesn’t have to.

The last major enhancement a had added to this utility was a “lasso” feature for selecting or discarding images and my throughput increased by an insane amount.  Afterwards, I wondered why I ever waited so long to implement that feature – it was a game-changer.  This new AI feature is going to be another game-changer on a whole new level.

VS Extensions – Writing Code To Write Code

It’s a phrase I’ve used many times:  Writing code to write code.  This time it was done to save my sanity.  As is pretty evident in my blog, I am a VB.NET developer.  Despite that, I can understand the shrinking usage of the language.  Many of the newer technologies are simply not supporting VB.NET.  And I guess to a certain degree, that’s fine.  VB.NET will continue in the arena it is well suited for, and it won’t for newer application designs and models.  So be it.

At my workplace, they’ve been a VB shop forever.  And I was the longest holdout on that technology, but I’ve finally caved in to the reality that we’re going to have to convert to C#.  It’s an ideal time as everything is kind of in a paused state.  So I’ve been evaluating code converters and one that I tried ran for over 30 hours on our primary desktop application, then crashed with a useless error.  in discussing this result with another developer, it was mentioned that C# is Option Strict always, which gave me an idea that the conversion was taking so long and maybe the crash was because the converter was attempting to resolve all the implicit conversions in the code.  That’s something we should clean up first on our own.

I had a grand idea many years ago to get the project Option Strict compliant and I remember when I turned the option on, I was floored by the number of errors that needed resolved.  You have to understand, this code is 15 years old, with a variety of developers at varying skill levels, and type safety has never been enforced.  Further, all the state was held in datatables (in which everything is an object) and there are a plethora of ComboBoxes and DataGridViews, where everything is an object, too.  So while bad coding practices were allowed, the architecture of the application worked against us as well.

When I committed to the fix and turned Option Strict on, I wasn’t prepared for the number of errors I had to fix.  It was over 10,000.  That’s 10k cases of implicit type conversion that need to be explicitly corrected.  And that’s the easy route.  Refactoring the code to actually use strong typing is essentially a rewrite.  That’s not possible.  I’m barely sure the refactoring is possible.  But, I can turn Option Strict back off and return at a later time if absolutely needed.,

After about 1500 instances of typing CInt and CStr and CBool, I finally had enough.  The difficult part was Ctrl-Shift-arrowing to get to the end of the element to be wrapped in the cast.  Sometimes it was easy, and I got quite quick at hitting End and a closing parens, but it wasn’t quick enough.  I needed to write an extension that would take whatever was highlighted and wrap it in a cast with a hotkey.  And so that’s what I wrote.

Now, with my mouse, I can double-click and drag to select full words, then with Shift-Alt and a key, I can wrap that code.  S for CStr, D for CDbl, B for CBool, etc.  My productivity grew exponentially.  Still though, I have about 8700 more fixes to get through.  A few full days of that and I should be good to go.

403 XMLRPC Error In WordPress Hosted On Windows IIS (Remember Your Security Changes)

This is more of a note for myself.  Hopefully it will end up in a search engine and I’ll be able to find my own solution when this happens again.

Yesterday, I went to post something to one of my blogs that I self-host on a Windows server.  You’re on one of those blogs right now, go figure.  And when I did, I got a very common error: 403 Forbidden when accessing the xmlrpc.php file.  But, I had just successfully posted two days ago!  What could have changed?

I connected to the server and updated the permissions on the folder, since that is usually what 403 means.  That didn’t fix it.  I did an IISRESET.  That didn’t fix it.  I did a full reboot.  That didn’t fix it.  I kept searching for information.

When you search for WordPress and xmlrpc, you get a crap-ton of results that warn of the dangers of leaving this file exposed.  It’s a gateway for hackers!!!!!!  It’s so hackable!!!!!  Turn it off!!!!  Disable any remote capabilities to WordPress!!!!  These warning are geared toward users who use and reuse weak passwords.  Of course there is authentication on anything xmlrpc does, so if you’re getting hacked, it’s your fault.  Granted, there’s no protection from a hacker brute-forcing your xmlrpc URL until it finally hits something, but again, strong, unique passwords, people!

With that little rant out of the way, I do recognize the high sensitivity of this file.  It would not be unlike me to take some extra precautions on it.  And that’s exactly what I had done, and what I had forgotten I’d done.

I set up an IP address restriction on just that file, so only my home PC could access it.  That makes sense since I only post using Live Writer from home.  However, my IP address from my ISP changed in the last couple days and I didn’t notice.  This is how I fixed it.  If you instead want to use this info to secure your own WP site (and forget about it until your IP changes), go ahead.

In IIS, I navigated to the xmlrpc.php file in the content view

Annotation 2020-04-19 111349

After highlighting the file, I right-clicked and chose Switch to Features View.  You can see the specific file name is in the title.  I chose IP Address Restrictions

Annotation 2020-04-19 111727

And in here is where I added/changed my home IP address.  You can find yours at https://www.whatismyip.com.

Annotation 2020-04-19 112011

And as you would expect, I am able to post to blogs again, as evidenced by this post right here.

Web Request Always Redirects, While Web Browser Does Not

Today I had a breakthrough on an issue that has stewing for many months.  Every once in a while, I would pick the project back up again and see if I could approach it differently and maybe come up with a solution.  Today is the day.

The problem was in one of my homebrew personal utility programs, used to download images from the Internet.  One day, when I went to run the utility, it suddenly started failing and I could not get it to resume.  When I would request the same URL in a web browser, the download worked without any problem, but trying to download using code was a no-go.

Clearly, there is a difference between handmade code and a web browser, the task was to determine what the difference was in order to make the code better emulate the browser.  I added many http headers, trying to find out what was the missing piece.  I tried capturing cookies, but that had no effect either.  I even used a network monitor to capture the network traffic and couldn’t see any significant difference between my code and a browser.

Today, starting from scratch in a new project, I tried again.  To get the basic browser header that I needed, I visited a web site, like I had done in previous attempts.  This particular web site returned a number of headers I had never seen before, which I found very curious.  When I duplicated all those headers in my code, my requests worked.  Eliminating the headers one by one identified the one that I needed: X-Forwarded-Proto.

This header has been around for nearly a decade, although I’ve never heard of it, nor seen it implemented anywhere.  It seems to be specific to load balancers, which is actually relevant since the image host I am calling runs through a load balancer.  I was able to add the header to a standard WebClient and didn’t actually need to go to the HttpWebRequest level of code and my requests started working again.

I am not sure if the obscurity of this header was a browser-sniffing check by the load balancer to weed out bots and automated download tools (of which my utility would be classified), or whether it was just part of a new, stricter set of w3c standards that got included in a software update.  Realistically, it didn’t break anything for their users since their primary traffic is web browsers.

Utility: HardLinker

Many years ago, I used to use ACDSee for image viewing and have since switched to Faststone Image Viewer.  Not all that long ago, I had sent an email to their “comment box” with praise and a suggestion (the praise was to hopefully get my suggestion implemented).  Despite not hearing back from them, it didn’t stop me from brainstorming solutions to my problem.  I have found a way to address my needs, by writing a utility.

To summarize my request, I was looking for some sort of photo management like “albums” or “collections”.  For a pure image viewing application like Faststone, I can understand the trepidation the developers would have in implementing that feature.  It would be sure to divide their userbase as to whether it was done properly and whether it should have been done at all.

You can organize photos into folders on your hard drive, sure, but what if you want a photo in two or more albums?  You have duplicate files wasting space and possibly getting out of sync.  So my suggestion was that the file browser in FastStone have the capability to parse windows shortcuts, which are just LNK text files with pointers to the original file.  That way, you could put shortcuts into album folders and keep the originals where they are.

Recently, I stumbled on the concept of symbolic links, which are like shortcuts, but operate at a lower level.  A symbolic link looks just like a file to an application.  It doesn’t know any different.  By using symbolic links, I could get the functionality I needed with no application changes to Faststone.  I tested this concept and it was completely successful.  So I set about trying to figure out how to build something that I could use with Faststone.

Faststone has a way to call external applications for editing photos.  I could use this same feature to call a utility to generate symbolic links in a selected folder.  My first version of the utility was very simple.  It simply opened a FolderBrowserDialog and wrote links for the files specified on the command line.  However, FastStone had a limitation.  It would not send multiple filenames on a command line.  Instead it would call the external application over and over, once per selected file.

The second version of the utility would capture the multiple calls and write the links in one pass, as long as you left the dialog open until Faststone was done calling the external application.  This was ok, but I couldn’t get any visual feedback on how many files were queued up while the dialog was open.  So I had to create my own folder browser and use my own form which would not be open modally.

As with any attempt to recreate a standard windows control. the results are less than perfect, although serviceable.  The result is a utility called HardLinker.  HardLinker will remember the last folder you used as well as the size of the window.  The files queued will climb as the utility is called multiple times.

image

Under normal usage, you need to be an administrator to create a symbolic link, so HardLinker needs to be run as an administrator.  However, in Windows 10, now you can enable developer mode and create links without being in admin mode.  See: https://blogs.windows.com/windowsdeveloper/2016/12/02/symlinks-windows-10/ for more information.

So, in addition to calling HardLinker from Faststone, I also added a shortcut to the utility in my SendTo folder, so I can use it with Windows Explorer.  Calling it from Explorer is very fast since all the files are on the command line and you don’t have to wait for the multiple instances to keep spinning up.

If you are interested in this utility, you can download it from here.

CUERipper CUE Only Modification

In another hobby of mine, CD collecting, I utilize a few different tools in my collecting and cataloging process.  One of which is Exact Audio Copy (EAC), which I use to rip the CD audio to my computer.  When you collect early-edition CDs as I do, you might run into some CDs that had a special equalization called pre-emphasis, which requires you to apply an inverse equalization to the files after ripping.  Additionally, some early CDs had track subindexes, where you could jump to defined sections within a long track.

EAC, while being a good ripping program, is not able to identify either of these special features, which is why I also use CUERipper.  CUERipper is a part of the CUETools suite, which is an open-source program for ripping, authoring, and modifying CUE files.  CUE files are the true index of a CD’s tracks.  They include information such as indexes, track gaps, and pre-emphasis flags.

However, one of the drawbacks to CUERipper is that you have to rip the entire CD to get the CUE file.  Well, you can cancel the rip early, or you can specify an invalid compressor causing it to error out after the CUE generation.  Neither is an efficient workflow.  So, I downloaded the source code and made some changes specific to what I wanted.

Since I was going to run through a large number of CDs to get the pre-emphasis and index info, I wanted a way to only generate a CUE file.  To make the process even quicker, I wanted CUERipper to start the rip as soon as it had the album metadata to use, and I wanted it to eject the CD as soon as it was done with the CUE file.

image

I added three new options to CUERipper, with obvious meanings:

  • Eject after extraction
  • Start immediately
  • Create CUE file only

The options are useful even if you are doing a full CD rip, but for my needs, the CUE file generation was reduced to inserting a CD, waiting about 30 seconds while the disc indexes are read, then switching CDs after the tray ejects.

If you are interested in using this modified version of CUERipper, you can download it from here.

Youtube-dl Bookmarklet

One of my favorite things is custom protocol handlers.  The idea you can launch an application with a URL is awesome to me.  I implemented it at my current company for their internal application and it is far and away the most progressive feature the business users love.

One of my side hobbies is digital archiving, which is simply downloading stuff and saving it.  Aside from the archivist mentality of it, I do like to have local copies of items just because you can’t trust the Internet for anything.  Content comes and goes at will, and even with that, connectivity comes and goes, and one thing that keeps coming and never goes is advertising.  So, local copies win out every time in every case.

I’ve written utilities to assist in my image archiving, but when it comes to video archiving, I’ve usually used FreeDownloadManager. but sometimes it doesn’t do the job.  I had recently learned of an open-source tool called YouTube-dl, which is a command-line tool to download videos from a wide variety of websites.

I downloaded it and tried it out on a few concert videos and the utility worked quite well, so it’s been added into my toolkit.  But, being a command-line tool, it’s a little cumbersome.  What I wanted to do was launch the downloader passing the URL of the page I was currently viewing, which was hosting the video I wanted to download.  This is something I do with my image downloading utilities, too, so it’s a concept not too foreign to me.  It involves custom protocol handlers and bookmarklets.

First thing we need to do is create a custom protocol handler.  For this, ytdl:// is the protocol I chose to use.  I created a registry file (protocol.reg) with the following contents.

Windows Registry Editor Version 5.00

[HKEY_CLASSES_ROOT\ytdl]
"URL Protocol"=""
@="URL:Youtube-DL Application"

[HKEY_CLASSES_ROOT\ytdl\shell]
[HKEY_CLASSES_ROOT\ytdl\shell\open]
[HKEY_CLASSES_ROOT\ytdl\shell\open\command]

@="\"wscript \"y:\\videos\\youtube-dlWrapper.vbs\" \"%1"\""

Notice the execution path command.  This will obviously be different paths for different people, but there is a VBScript file that needs to be created at that location to execute.  The contents of that script are as follows.

dim shell
set shell=createobject(“wscript.shell”)

cmdline=”””y:\videos\youtube-dl.exe”” “”” & replace(wscript.arguments(0),”ytdl://”,””) & “”””
ret=shell.run(cmdline,8,true)

set shell=nothing

The cmdline is generated to execute the actual youtube-dl utility, passing the URL of the video.  However, the URL in arguments(0) has the custom protocol handler ytdl:// prefixing it, so that is stripped out.

The last part is the bookmarklet itself.  In your web browser, create a new bookmark for whatever page you’re on, then edit the bookmark to give it a better title and to have the following URL:

javascript:location.href='ytdl://'+location.href;

This is how the VBScript ends up with ytdl:// in the address.  To summarize the entire process:

  1. You click the bookmarklet on a page with a video
  2. The URL of the page is prepended with ytdl:// and is navigated to
  3. Your browser recognizes it as a custom protocol and executes the URL on the local system (after appropriate security warnings)
  4. Windows looks up the protocol in the registry and determines it is to launch youtube-dlWrapper.vbs passing the URL as an argument.  The URL still has ytdl:// in front of it at this time.
  5. The wrapper VBScript strips out the leading protocol and passes the remaining URL to the youtube-dl.exe for download.
  6. A command window opens with youtube-dl running and closes when the download is complete.

System.Net.Http Error After Visual Studio 15.8.1 (2017) Update

Today we started getting an error from three web applications.  Two were web services and one was an ASP.NET MVC website.  The error was:

Could not load file or assembly ‘System.Net.Http, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a’ or one of its dependencies. The located assembly’s manifest definition does not match the assembly reference. (Exception from HRESULT: 0x80131040)

The error was occurring on this standard line in the global.asax file:

WebApiConfig.Register(GlobalConfiguration.Configuration)

Looking at the trace statements on the error page, there was a mention of a binding redirect, which I recall seeing in the web.config of the site.  That section in the web config looked like:

<runtime>
    <assemblybinding xmlns="urn:schemas-microsoft-com:asm.v1">
      <dependentassembly>
        <assemblyidentity culture="neutral" publickeytoken="b03f5f7f11d50a3a" name="System.Net.Http">
          <bindingredirect newversion="2.2.29.0" oldversion="0.0.0.0-2.2.29.0">
          </bindingredirect>
        </assemblyidentity>
      </dependentassembly>
    </assemblybinding>
</runtime>

This config block had been in place since December, 2015.  The block seems to have been added by a NuGet package at the same time, possibly with a Framework version upgrade.

However it originally ended up there, removing the config block allowed the sites to launch again.  It’s unknown why this started happening after a Visual Studio upgrade.

Batch Applying MetaFLAC ReplayGain Tags To Recursive Album Folders

I’ve been ignoring PowerShell for a very long time, even though I know it is the future of scripting.  The structure of the language is very foreign, with all of the pipe options.  But I muddled through the examples I found online and came up with my first usable PowerShell script.

The purpose of the script is to iterate recursively through a folder structure and generate a command statement using all the files in each folder.  In this case, MetaFLAC is being called with arguments to apply ReplayGain tags to all the files in a folder.  To do this effectively, you have to pass all the file names in the folder on one command line.  This is so the overall album sound level can be calculated.

Without further introduction, here is the script:

<# 
     This script applies album-level and file-level 
     ReplayGain tags to FLAC album folders recursively

    Example usage from PS command line: 
     .\ReplayGain.ps1 -root:"c:\music\FLAC"

    MetaFLAC.exe must be in same folder or in environment PATH 
#>

Param( 
     #Fully-qualified path to top-level folder. 
     [string]$root 
     )

Function CallMetaFLAC($d){ 
     Write-Host "Processing" ($d | Get-ChildItem -Filter "*.flac").length "files in" $d 
     if (($d | Get-ChildItem -Filter "*.flac").length -gt 0){ 
         $args="--add-replay-gain "

        foreach ($f in ($d | Get-ChildItem -Filter "*.flac" | % {$_.FullName}) ){ 
             $args=$args + """$f"" " 
             }

        Start-Process "metaflac.exe" $args -Wait -NoNewWindow 
     }

    # Process Subfolders 
     foreach($dd in $d | Get-ChildItem -Directory){ 
         CallMetaFLAC $dd 
         } 
}

Write-Host "Starting in $root" 
Write-Host

CallMetaFLAC $root

Write-Host 
Write-Host "Ending"

SSRS Projects Fail To Load/Incompatible With Visual Studio 2017

Our team recently began having trouble working with SSRS reports in a VS2017 solution.  The problem began when I opened up the report solution and was prompted to upgrade the report project files.  If I didn’t upgrade them, I couldn’t use them.  The project files were upgraded, then no one else on the team could use them.

The problems varied from system to system.  Some of the errors said that the project was incompatible.  Some said there was an XML error.

There was a lot of comparing systems and troubleshooting.  Everyone was sure to have SQL Server Data Tools (15.3 Preview) installed.  Some uninstalled and reinstalled this component, but that probably was not a factor.

The two things that really mattered were that the VS Extension for Reporting Services was installed and that it was V1.2 (some had v1.15 and could not open projects until it was upgraded).  Those that uninstalled SQL Data Tools had this extension uninstalled automatically and also had to do a repair install on SQL Server Management Studio.

image

The other issue, was that (most likely) the rptproj.user files were incompatible with the 1.20 extension.  Deleting *.rptproj.user files allowed the projects to load.