Development blog

Extra logging and more ARM retries

So this is an odd one, there is one customer who has issues with the database not being found via the ARM api even though it was created using SQL directly. This doesn’t seem to happen very often even for this one customer so I’ve put a bit of extra logging in place and a some specific retry code for this one case.

Hopefully it will help but nobody else has mentioned the issue yet, so here’s hoping.

#azure, #armapi, #why do i need to do this
Thu, 14 Sep 2017 10:23:44 +0100

ARM api retries

This mornings release improves the retry logic around the Azure Resource Management API. There have been issues with the API at Microsoft being unavailable or returning 500 errors so I’ve put a complete retry routine in at the lowest level possible.

Hopefully this should make jobs more reliable that are linked to Azure Subscriptions.

If you aren’t already you should really be using the rather wonderful Polly fault handling library.

#earlymorning, #azure, #polly
Tue, 01 Aug 2017 08:16:09 +0100

Secondary storage

So what should have been a simple change seems to have permeated through large swathes of the system. All to allow -secondary azure storage accounts as Azure Blob sync and Azure Table sources.

Sadly this doesn’t work for Azure Files or database restoration both of which seem to be Microsoft limitations.

I’ve also put some retry code in this release as occasionally fetching the storage account from an Azure subscription would fail on the Microsoft servers with a 500 error.

Lastly there is a little fix for removing glob matches when editing or cloning a sync job. Previously the only way this would work was to empty the text box even if the check box was unticked.

Enjoy, and let me know if you hit any issues.

#geo redundancy, #azure, #release all the things
Tue, 25 Jul 2017 16:22:05 +0100

The tiniest of releases

Just deployed a version with an added try/catch block as due to the way that Azure Files are listed - the checksum and last modified date has to be retrieved separately - there was an issue where a file was returned in the list but removed before the checksum was retrieved causing the job to fail.

This is a limitation of the Microsoft API for Azure Files, the only thing I can think of to improve listing speed is to offer a “file length only” comparison which would only compare files based on their length, copying them only when they differ. Seems like it may not be worth the effort, thoughts?

Also, the archiving is still progressing nicely so in another 20 days or so we should be caught up and I can look at removing old jobs completely as I mentioned before.

#exceptions, #azure files, #azure, #teenytinyrelease
Fri, 07 Jul 2017 17:49:26 +0100

More database changes

So another change to the behaviour of the database that I’ve just put live. This moves the updates for the job progress into a separate table - rather than the massive 4M row table.

This should mean that Azure SQL Server doesn’t want to regenerate the stats for the massive table anywhere near as frequently as it has been doing (twice or three times a day). It’s an interesting issue as although the table is very large the number of rows affected by updates is only in the hundreds. However SQL Server doesn’t seem clever enough to realise that this can’t really affect the stats and just keeps a count of the number of UPDATE operations performed. Moving all but one of those updates into another table I hope will make things much more stable.

If this works I’ll re-instate the history archiver and start writing a new archiver to remove job progress information older than 2 years, which should decrease the size of my 4M row table as well.

Apart from that I also had to put a quick update live as FastSpring call backs started failing a couple of days ago. All the callbacks have been retried and succeeded so if you were missing your latest invoices or modifications to your subscription that should all be there now.

#dtuhell, #sqlserver, #azure, #fastspring
Mon, 26 Jun 2017 13:23:10 +0100

The pain of DTU

This has not been a good few days under the hood. I still don’t understand how Azure decides it’s execution plans but a few days ago two scheduling queries suddenly started misbehaving. Instead of performing a WHERE clause on 100 row table in the join it started applying it on the 4M row table instead.

I’ve deployed at least three different stored procs, each of which worked for a few hours and then went off the rails. I’ve now rush deployed a denormalization of the tables with a perfect index for the two queries that have been misbehaving.

I did watch one attempted fix last night and as things were all still working at 4am I decided that enough was enough. Of course Azure then changed the query execution plan at 5am, luckily I don’t sleep much due to stress currently and I had the denormalization ready to deploy just-in-case.

For now I’ve also shut down the archiver that is meant to be shrinking the database until I see a few days stability.

There would have been minimal visible effects of this as database queries are wrapped in retries so mostly as long as it eventually works it would be ok.

In the past few days I’ve deployed database changes about 4 times, I’ve redeployed the archiver 7 times after performance tuning to limit DTU effect (it doesn’t run in “busy” hours anyway), and finally I’ve deployed the scheduler, allocator, and job runner to perform the table denormalization.

It’s been a hard weekend, not helped by being really quite warm in the UK and we’re not used to it.

One question that I would like answered is what is a reasonable retention period for job history in the system? I’m thinking of removing any results older than two years. If you have opinions please let me know at richard.mitchell@cherrysafe.com

No holiday since 2014 is starting to get to me.

#azure, #dtu, #whatissleep, #stressed
Tue, 20 Jun 2017 10:04:41 +0100

History archiving is back

It’s taken a while but I’ve finally re-instated the history archiving. This moves finished job history out of the database and into blob storage. I created this when I took over the system from Redgate to reduce the database size but never left it as an automated process, it now is.

There are a lot of jobs to process so it’s going to take a while, but it should ease pressure on the database gradually.

I’ve also put a little release of the micro run allocator live which should give me more useful logs - I’ve just discovered trends in loggly - pretty cool.

#istybitsyrelease, #archiving, #lunchtime
Wed, 14 Jun 2017 13:00:23 +0100

Gradual loading history

This morning’s release should be almost invisible. I’ve modified the way job history details are loaded. Previously it would load all the log lines and then every refresh would reload all the lines. Now instead it loads only those lines added since the load via ajax. This should ease some of the pressure on my DTU in the database.

This works for both running and finished jobs, for finished jobs the lines are only loaded when you scroll down. It does make it more awkward to get a complete log listing for a job but if that proves to be an issue I can introduce something else.

There is also an attempt to make “scheduled” jobs load more quickly but I’m unsure if what I’ve done will have any improvement. This should use a smaller index so I’m hopeful. The job history now doesn’t access a table that previously could have caused contention.

So all in all this is the first of a few releases I hope to do to improve underlying database performance.

#dtu, #azure, #ajax, #ihadacold
Wed, 14 Jun 2017 09:53:35 +0100

More error checking

So this is a wrap up of a few releases that I’ve done recently, a few of which were too small individually to comment upon.

Firstly today’s release adds better checking in the UI for when a job is switched to use an Azure Subscription. Previously if you were creating jobs it all worked as you’d expect. However when editing or cloning an existing job a lot of the error checks wouldn’t fire eg. your storage account isn’t in the subscription you selected.

Now if you select a subscription that doesn’t contain the storage account then the storage name is cleared, the same goes for Azure SQL Servers.

Similarly if you switched a job from a subscription to using access keys then it didn’t require you to re-enter the access key - it does now.

I’ve also added a timeout in table backup jobs so that if no new rows are received from Azure storage for a couple of hours the table backup will fail. This was introduced after a customer entered a table filter that didn’t return so the job would just hang. I can’t reproduce this myself so I’ve tested this as much as I can but should it fail it may not fail with the nice message I expect.

Also a customer hit an issue when trying to backup a SQL Server (not Azure) using an IP address that didn’t have a reverse dns entry the UI would incorrectly report an invalid server name.

There was also an issue that I introduced when cleaning up the website recently that meant you couldn’t remove an azure subscription link. I simply had the wrong url but it was affecting users trying to link new subscriptions.

Lastly the system now allows storage accounts in core.usgovcloudapi.net, again this was just a UI issue as most of the work I did to support German datacenters already allowed arbitrary account names.

#stabilityisking, #azure, #rainydays
Tue, 06 Jun 2017 11:45:54 +0100

Cancelling jobs better behaved

So in this release there are couple of fixes for poor behaviour when jobs were being cancelled.

The main change is in azure table backup jobs where cancelling during a large table would result in an error “Failure backing up table A task may only be disposed if it is in a completion state (RanToCompletion, Faulted or Canceled).” This was due to rather naive use of a task when buffering rows from the table. The cancellation should now be much cleaner, and also quicker.

There was also an incorrect error being returned when cancelling an Azure SQL Database restore as the cancellation exception was being thrown inside a try/catch block.

These were both spotted in the system logs rather than customer feedback.

Back to working on the job running engine after a lot of time spent on the website recently.

#logs, #azure table backup, #azure
Wed, 03 May 2017 12:08:40 +0100

More customer api

So this is the second release of the preview customer api. This should now be easy to discover for those who want to give it a try.

This authenticates via api keys which can be configured from the website settings.

There is example code in github for the api - https://github.com/cherrysafe/customer-api

The api itself is documented via swagger - https://www.cherrysafe.com/swagger/ui/index

This has taken a lot longer than I hoped as the interaction between web api/swashbuckle/swagger/autorest still seems a little hard. The web api itself is considerably smaller than the config for swashbuckle. A few of the issues I faced where…

  • api-version parameters
  • Polymorphism
  • ReadOnly objects don’t support POST
  • api-key parameter
  • Examples for request conflict with response
  • Namespace and class name clashes cause horrible names

Still that’s behind now, time to move it forward with your input as I hope this forming object model will move down into the database to solve some of the scalability issues I’m facing.

#swashbuckle, #swagger, #webapi, #autorest, #thingsshouldntbethishard
Tue, 25 Apr 2017 14:51:59 +0100

More website updates

This update changes the way help is offered in the tools, rather than via hover pop-up it now uses modal windows. This works much better on both smaller and larger screens so I’ve been able to spruce up the appearance of the help.

Also in this update is a customer API for listing scheduled jobs, creating new jobs (sql azure to azure only currently), cancelling jobs, and retrieving jobs. This is a v0 of the api and if you want access to it let me know and I’ll send some documentation.

The plan is to encompass all the job types and use the api in the webpages directly and maybe use these new objects in the database itself as there are some scaling issues I need to address and now is a good point to change the serialisation.

So again although it looks small the new api is many thousands of lines of code that are a parallel system currently.

That was pretty intense.

#webapi, #inthezone, #sqlazure, #backups
Thu, 13 Apr 2017 13:31:36 +0100

Servers rebooted...twice

Well that was nice, looks like Microsoft Azure decided to reboot my main worker virtual machines in the night, twice.

I suppose I shouldn’t complain too much as this is the first time I remember this happening on such a scale (it does happen very occasionally).

Oh well, this was just to let you know what happened.

Tue, 11 Apr 2017 05:53:28 +0100

New website

Although you may not notice, I’ve just put a new website live. The only major difference you should notice is that the website is responsive for smaller screens. Should you notice any problems let me know and I can fix it up quickly.

The web technologies that the site was built on were about 4 years out of date so I’ve updated to the latest stable components where I can. This has involved changing 700 files and deleting several hundred - no small feat I think.

There is also enhanced logging on the website, which is something that has been sorely missed on occasion. I should now be able to be just as preemptive fixing issues on the website as I have been on the main services.

I have completely rewritten the less/css for the site as well so you may notice a few minor appearance changes, nothing too drastic I hope.

I’ve not done a huge amount of browser testing but as long as you’re using the latest version of your browser you should be ok. For example I’m now using svg sprites for the logos, which are both responsive and look good on a high dpi monitor.

It has reminded me what a mess current web development has become with so many competing “standards” that only seem to last a few years. So although I’m up to date today I imagine I’ll have to do something similar each year to keep up.

A few issues you may encounter on smaller screens.

  • Interactive help on the tool forms may appear off page
  • Create a new job is vertically stacked which is a bit odd
  • Some text is very large

I still have some work to finish off (like the interactive help, and removing the last icon font in favour of svg sprites) but I decided I’ve been sitting on this release for too long already so it’s best to get it out there.

This has been a pretty intense month of work doing this, but at least the website has had a thorough cleanup and I’m used to the code base again.

Just to re-iterate, if you spot any issues please let me know.

#azure websites, #hipster, #grunt, #bower, #npm, #bootstrap, #jquery, #less
Sat, 08 Apr 2017 11:23:33 +0100

Partial Azure table backup

So this is a big one. I’ve added the ability to specify a table filter when backing up Azure table storage. This adds a lot of new potential use cases for the backup feature.

Firstly, this is an advanced feature so if you’re not confident try playing around with queries using the rather excellent storage explorer from Microsoft .

You can for example backup only rows that have been modified in a certain timeframe based on UtcNow.

eg PartitionKey eq ‘richard.mitchell@cherrysafe.com’ and StartTime gt datetime’%%DateTime.UtcNow.AddDays(-30)%%’

You could also remove old rows from the table by specifying remove rows and also something like.

eg Timestamp lt datetime %%DateTime.UtcNow.AddYears(-1)%%

This is of course dangerous so be careful. Azure table queries are notorious for timing out and being case sensitive so be sure your queries return in a timely manner and are formatted correctly. For example an easy trap to fall into is to not include the type specifier before a datetime or other type.

Have a very good read of the MSDN documentation on Querying Tables and Entities and make sure you understand that certain queries are much more intensive than others.

Also included in this release is hopefully a fix for a deadlock that was happening very occasionally when backing up a table and a retry was requested. It’s something I’ve been keeping an eye on. There was also an issue where cancelling a table backup job would continue to fail each table separately which I also believe I’ve fixed.

Lastly the glob match feature and the new date azure table filter preview are now performed on input rather than having to press a button thanks to the rather excellent jquery.bindWithDelay.

#azure, #azure table backup, #filter, #retention, #scary
Thu, 16 Mar 2017 11:52:31 +0000

The little things

Tiny update this morning, mainly to test my deployment mechanisms for updating the website. This just moves a few callback apis into proper web apis, including changing the url of some Azure AD and FastSpring callbacks.

This starts my move to web api which is a precursor to adding a customer api.

My website deployments used to take 20 minutes and now they take under a minute (including swap time). Got to be happy with that. The difference between classic web roles and azure websites.

#azure, #azure websites, #thelittlethings, #pre breakfast
Tue, 14 Mar 2017 08:56:10 +0000

DNS Switchover

Just kicked off the DNS switch over to the new Azure Website, this should happen over the next few hours. If you experience any issues flush your dns cache or use cs-website.azurewebsites.net in the interim.

I’ll be shutting down the old web role later today. Still, what are weekends for?

Sat, 11 Mar 2017 06:23:56 +0000

Website to Azure websites

In case you didn’t realise it but Cherry Safe in it’s current and previous incarnation have been around a while. As a consequence sometimes I need to move sideways. Today is one of those days as I’m moving the website from an Azure Web Role to an Azure Website.

Luckily I can feed some of the learning from recent work on the monitoring website into this move.

So tomorrow morning (UK time) I’ll actually move the DNS for the website over to the new location. It does seem to be working once I found the appropriate documentation - https://docs.microsoft.com/en-us/azure/app-service-web/app-service-custom-domain-name-migrate

One of my pain points has been updating the web role as it takes about 20 minutes with a small amount of downtime everytime (even though this shouldn’t happen, it does). This should make things much simpler, quicker and more reliable.

You can try out the website now if you want by visiting https://cs-website.azurewebsites.net/

It’s a bit scary, hence me doing the actual move over the weekend to make sure everything works out. Once I’m sure it’s all ok I’ll turn off the old website.

#azure, #nail biting
Fri, 10 Mar 2017 10:53:05 +0000

Scaling for restore

I’ve decided to take a break from sorting out some of the code issues for now to develop a new feature.

This one has been waiting for a while and allows Azure SQL Database restores to be restored at one tier and then reduce the tier once the restore has completed.

Just configure your restore tier and then final tier. Once the restore has completed the system sends the request to scale to Azure and completes immediately without waiting for the scaling operation to complete.

Hope this is of some use.

#quickwins, #azure, #sqlazure, #restore
Mon, 06 Mar 2017 14:17:50 +0000

Cleanup #3

The “last for now” cleanup has just gone live. This has removed all the non-microrun mechanism for running jobs which was just making the code harder to understand. I’ve also taken the time to rewrite the monitoring website for the system which also had several thousand lines of unused code.

There shouldn’t really be any visible changes as a result of this release as it was mainly internal housekeeping, more tests, cleaner code, and simpler deployment. There was however a very intermittent error (a couple of times a month) where a job would fail to start due to a poor query - this has also been changed for a hopefully more efficient query.

I’ve also taken this opportunity to update the windows services on the microrun machines as although there wasn’t an issue with them it’s nice to run newer code - the previous release was June last year of that component.

I did get a little distracted trying to sort out the history table - a mere 60Gb and rising fast - but I’ll save that for another time as I really need to change the clustered index of the table to be able to get it under control.

As an aside I’ve also starting playing with my 3D printers again. It’s nice to feel I have time to for hobbies now things are a bit more stable. The microrun system is really the best idea for maintenance and development that has been done for the system since it’s inception.

#azure, #cleanup, #3d printing, #morecoffee
Tue, 28 Feb 2017 10:36:22 +0000