Try try again.
Sometimes I feel my entire life is spent writing various exception handling and retry logic. This seems to be the way when communicating with services in “the cloud”.
So today’s release reworks some retry logic around backing up and restoring Azure SQL Databases. I think Microsoft may have changed the json return format for errors so I’ve changed things to support the new format which was causing some issues with errors being hidden or retries in the event of failure.
Before I had a retry in a loop in a retry, now I merely have a retry in a retry. The rather excellent Polly library helping simplify things.
I’ve also increased the duration of retries in the event of failure to start export or import operations, trying to cope with unexpected seemingly transient errors from the Microsoft Import/Export service.#retry, #azure, #sql database, #polly
Mon, 13 Nov 2017 12:14:03 +0000
Firstly I’ve added a secondary setting for email notifications which allows failure emails to go to a different address. This was suggested by a customer and seemed like a good idea, so there it is.
Also, and I know this is way overdue, you can now specify multiple addresses using commas. It’s embarrassing that it’s taken this long to implement but all that was needed a few changes to the UI. There system now uses a simplified regex to check for validity.
I’ve made some changes to error log handling as there were a couple of unprotected cast operations in exception blocks that were blocking true error messages. These have been intermittent so i don’t think too many people were impacted these issues.
Coming up next is probably going to be a major change to the way sync jobs work behind the scenes. Instead of storing two file lists of up to 50Gb it will use one database that is added to up to 50Gb and removed from for every file that doesn’t need syncing. This means my peak temporary storage will be much reduced and the dynamic use will also be much lower. Hopefully this won’t make too much difference in speed of jobs but I have to do this to support more job runners around the world and allowing me to use smaller job runners for cost reasons.#email notifications, #long overdue
Tue, 07 Nov 2017 10:47:04 +0000
Last night I put a hotfix live for a customer as the system could no longer identify the datacenter for Azure SQL Databases in Australia East.
This was due to Microsoft changing the dns format from australiaeast1-a.control.database.windows.net to cr1.australiaeast1-a.control.database.windows.net - notice the extra cr1. at the beginning.
As data centers are located by checking if the DNS .StartsWith() this failed, so last night I changed it to .Contains().
However, this morning the error logs (and a customer bug report) indicated an issue. The system was now mis-identifying southcentralus1 and northcentalus1 as centralus1 - of course you can see why.
So this morning I’ve just put another fix live that should fix the fix.
Moral of the story - even if it’s a one line change with tests (yes, I wrote tests), don’t put something live without thinking it through properly.
If you had backups start between about 10pm UTC to 7am UTC you can rerun them, or next time they should be ok.
If your job history starts with the line ‘Job running on micro runner 2.0.20171027.2312‘ that is the version that was broken.#whoops, #rookiemistake
Sat, 28 Oct 2017 06:42:52 +0100
So this update is a more or less complete re-work of the way that Azure Subscriptions are linked.
The main issue was due to the assumption that once an application was added it would become available instantly via the Graph API in order to ‘Add roles’. This is not true and users were presented with a rather ugly error message.
Now linking goes in three stages
1) Application is added to AD
2) Application can then be waited for in Graph AD before ‘Add roles’ is used.
3) Or user can choose to add roles directly via the Azure Portal.
If you’ve been having issues linking please try again and hopefully it will behave itself better now.#azure, #azure subscription, #nasty ui
Thu, 26 Oct 2017 14:07:59 +0100
This release contains a couple of user requested features.
The first is an option added to S3 targets to enable SSE-S3 at-rest encryption when files are copied to S3. Unlike Azure you can’t do this for a whole bucket, it has to be done for each individual request. This defaults to false as I never like changing default behaviour unless absolutely necessary.
The second is the ability for the temporary copy in SQL Azure to use a different SERVICE_OBJECTIVE to the source database. There is a large caveat here in that it must use the same EDITION so S3->S1 would work but not P1->S0. It was requested in order to create a copy outside of an elastic pool. If you hit any issues please let me know as I don’t have any elastic pools on which to test it myself.
These new options are available via the Customer API too (I’m trying to keep this up-to-date).
I’ve also been doing more C# cleanups using pattern matching and out variables, it all just makes the code a little cleaner.#aws, #sse-s3, #encryption, #sqlazure, #customerlove, #s3
Fri, 20 Oct 2017 10:27:00 +0100
This has taken quite a while but now the customer api supports almost all of the available job types on the system. You can get your jobs, create new jobs, modify existing jobs, and cancel jobs. It’s taken a lot of work and I hope people will find it useful. I need to take a breather before I go back and finish the last features.
Also in this update is support for a few more SQL Azure service tiers. There are bugs in the Microsoft Import service that prevent me adding all of them (I only found this after adding them all and it failing in testing, it’s been one of those months).
Now I’m going to work on a couple of small customer features which I hope won’t take me anything like this long to release.#restapi, #timetorest
Fri, 13 Oct 2017 15:29:30 +0100
So this is an odd one, there is one customer who has issues with the database not being found via the ARM api even though it was created using SQL directly. This doesn’t seem to happen very often even for this one customer so I’ve put a bit of extra logging in place and a some specific retry code for this one case.
Hopefully it will help but nobody else has mentioned the issue yet, so here’s hoping.#azure, #armapi, #why do i need to do this
Thu, 14 Sep 2017 10:23:44 +0100
This mornings release improves the retry logic around the Azure Resource Management API. There have been issues with the API at Microsoft being unavailable or returning 500 errors so I’ve put a complete retry routine in at the lowest level possible.
Hopefully this should make jobs more reliable that are linked to Azure Subscriptions.
If you aren’t already you should really be using the rather wonderful Polly fault handling library.#earlymorning, #azure, #polly
Tue, 01 Aug 2017 08:16:09 +0100
So what should have been a simple change seems to have permeated through large swathes of the system. All to allow -secondary azure storage accounts as Azure Blob sync and Azure Table sources.
Sadly this doesn’t work for Azure Files or database restoration both of which seem to be Microsoft limitations.
I’ve also put some retry code in this release as occasionally fetching the storage account from an Azure subscription would fail on the Microsoft servers with a 500 error.
Lastly there is a little fix for removing glob matches when editing or cloning a sync job. Previously the only way this would work was to empty the text box even if the check box was unticked.
Enjoy, and let me know if you hit any issues.#geo redundancy, #azure, #release all the things
Tue, 25 Jul 2017 16:22:05 +0100
Just deployed a version with an added try/catch block as due to the way that Azure Files are listed - the checksum and last modified date has to be retrieved separately - there was an issue where a file was returned in the list but removed before the checksum was retrieved causing the job to fail.
This is a limitation of the Microsoft API for Azure Files, the only thing I can think of to improve listing speed is to offer a “file length only” comparison which would only compare files based on their length, copying them only when they differ. Seems like it may not be worth the effort, thoughts?
Also, the archiving is still progressing nicely so in another 20 days or so we should be caught up and I can look at removing old jobs completely as I mentioned before.#exceptions, #azure files, #azure, #teenytinyrelease
Fri, 07 Jul 2017 17:49:26 +0100
So another change to the behaviour of the database that I’ve just put live. This moves the updates for the job progress into a separate table - rather than the massive 4M row table.
This should mean that Azure SQL Server doesn’t want to regenerate the stats for the massive table anywhere near as frequently as it has been doing (twice or three times a day). It’s an interesting issue as although the table is very large the number of rows affected by updates is only in the hundreds. However SQL Server doesn’t seem clever enough to realise that this can’t really affect the stats and just keeps a count of the number of UPDATE operations performed. Moving all but one of those updates into another table I hope will make things much more stable.
If this works I’ll re-instate the history archiver and start writing a new archiver to remove job progress information older than 2 years, which should decrease the size of my 4M row table as well.
Apart from that I also had to put a quick update live as FastSpring call backs started failing a couple of days ago. All the callbacks have been retried and succeeded so if you were missing your latest invoices or modifications to your subscription that should all be there now.#dtuhell, #sqlserver, #azure, #fastspring
Mon, 26 Jun 2017 13:23:10 +0100
This has not been a good few days under the hood. I still don’t understand how Azure decides it’s execution plans but a few days ago two scheduling queries suddenly started misbehaving. Instead of performing a WHERE clause on 100 row table in the join it started applying it on the 4M row table instead.
I’ve deployed at least three different stored procs, each of which worked for a few hours and then went off the rails. I’ve now rush deployed a denormalization of the tables with a perfect index for the two queries that have been misbehaving.
I did watch one attempted fix last night and as things were all still working at 4am I decided that enough was enough. Of course Azure then changed the query execution plan at 5am, luckily I don’t sleep much due to stress currently and I had the denormalization ready to deploy just-in-case.
For now I’ve also shut down the archiver that is meant to be shrinking the database until I see a few days stability.
There would have been minimal visible effects of this as database queries are wrapped in retries so mostly as long as it eventually works it would be ok.
In the past few days I’ve deployed database changes about 4 times, I’ve redeployed the archiver 7 times after performance tuning to limit DTU effect (it doesn’t run in “busy” hours anyway), and finally I’ve deployed the scheduler, allocator, and job runner to perform the table denormalization.
It’s been a hard weekend, not helped by being really quite warm in the UK and we’re not used to it.
One question that I would like answered is what is a reasonable retention period for job history in the system? I’m thinking of removing any results older than two years. If you have opinions please let me know at email@example.com
No holiday since 2014 is starting to get to me.#azure, #dtu, #whatissleep, #stressed
Tue, 20 Jun 2017 10:04:41 +0100
It’s taken a while but I’ve finally re-instated the history archiving. This moves finished job history out of the database and into blob storage. I created this when I took over the system from Redgate to reduce the database size but never left it as an automated process, it now is.
There are a lot of jobs to process so it’s going to take a while, but it should ease pressure on the database gradually.
I’ve also put a little release of the micro run allocator live which should give me more useful logs - I’ve just discovered trends in loggly - pretty cool.#istybitsyrelease, #archiving, #lunchtime
Wed, 14 Jun 2017 13:00:23 +0100
This morning’s release should be almost invisible. I’ve modified the way job history details are loaded. Previously it would load all the log lines and then every refresh would reload all the lines. Now instead it loads only those lines added since the load via ajax. This should ease some of the pressure on my DTU in the database.
This works for both running and finished jobs, for finished jobs the lines are only loaded when you scroll down. It does make it more awkward to get a complete log listing for a job but if that proves to be an issue I can introduce something else.
There is also an attempt to make “scheduled” jobs load more quickly but I’m unsure if what I’ve done will have any improvement. This should use a smaller index so I’m hopeful. The job history now doesn’t access a table that previously could have caused contention.
So all in all this is the first of a few releases I hope to do to improve underlying database performance.#dtu, #azure, #ajax, #ihadacold
Wed, 14 Jun 2017 09:53:35 +0100
So this is a wrap up of a few releases that I’ve done recently, a few of which were too small individually to comment upon.
Firstly today’s release adds better checking in the UI for when a job is switched to use an Azure Subscription. Previously if you were creating jobs it all worked as you’d expect. However when editing or cloning an existing job a lot of the error checks wouldn’t fire eg. your storage account isn’t in the subscription you selected.
Now if you select a subscription that doesn’t contain the storage account then the storage name is cleared, the same goes for Azure SQL Servers.
Similarly if you switched a job from a subscription to using access keys then it didn’t require you to re-enter the access key - it does now.
I’ve also added a timeout in table backup jobs so that if no new rows are received from Azure storage for a couple of hours the table backup will fail. This was introduced after a customer entered a table filter that didn’t return so the job would just hang. I can’t reproduce this myself so I’ve tested this as much as I can but should it fail it may not fail with the nice message I expect.
Also a customer hit an issue when trying to backup a SQL Server (not Azure) using an IP address that didn’t have a reverse dns entry the UI would incorrectly report an invalid server name.
There was also an issue that I introduced when cleaning up the website recently that meant you couldn’t remove an azure subscription link. I simply had the wrong url but it was affecting users trying to link new subscriptions.
Lastly the system now allows storage accounts in core.usgovcloudapi.net, again this was just a UI issue as most of the work I did to support German datacenters already allowed arbitrary account names.#stabilityisking, #azure, #rainydays
Tue, 06 Jun 2017 11:45:54 +0100
So in this release there are couple of fixes for poor behaviour when jobs were being cancelled.
The main change is in azure table backup jobs where cancelling during a large table would result in an error “Failure backing up table A task may only be disposed if it is in a completion state (RanToCompletion, Faulted or Canceled).” This was due to rather naive use of a task when buffering rows from the table. The cancellation should now be much cleaner, and also quicker.
There was also an incorrect error being returned when cancelling an Azure SQL Database restore as the cancellation exception was being thrown inside a try/catch block.
These were both spotted in the system logs rather than customer feedback.
Back to working on the job running engine after a lot of time spent on the website recently.#logs, #azure table backup, #azure
Wed, 03 May 2017 12:08:40 +0100
So this is the second release of the preview customer api. This should now be easy to discover for those who want to give it a try.
This authenticates via api keys which can be configured from the website settings.
There is example code in github for the api - https://github.com/cherrysafe/customer-api
The api itself is documented via swagger - https://www.cherrysafe.com/swagger/ui/index
This has taken a lot longer than I hoped as the interaction between web api/swashbuckle/swagger/autorest still seems a little hard. The web api itself is considerably smaller than the config for swashbuckle. A few of the issues I faced where…
- api-version parameters
- ReadOnly objects don’t support POST
- api-key parameter
- Examples for request conflict with response
- Namespace and class name clashes cause horrible names
Still that’s behind now, time to move it forward with your input as I hope this forming object model will move down into the database to solve some of the scalability issues I’m facing.#swashbuckle, #swagger, #webapi, #autorest, #thingsshouldntbethishard
Tue, 25 Apr 2017 14:51:59 +0100
This update changes the way help is offered in the tools, rather than via hover pop-up it now uses modal windows. This works much better on both smaller and larger screens so I’ve been able to spruce up the appearance of the help.
Also in this update is a customer API for listing scheduled jobs, creating new jobs (sql azure to azure only currently), cancelling jobs, and retrieving jobs. This is a v0 of the api and if you want access to it let me know and I’ll send some documentation.
The plan is to encompass all the job types and use the api in the webpages directly and maybe use these new objects in the database itself as there are some scaling issues I need to address and now is a good point to change the serialisation.
So again although it looks small the new api is many thousands of lines of code that are a parallel system currently.
That was pretty intense.#webapi, #inthezone, #sqlazure, #backups
Thu, 13 Apr 2017 13:31:36 +0100
Well that was nice, looks like Microsoft Azure decided to reboot my main worker virtual machines in the night, twice.
I suppose I shouldn’t complain too much as this is the first time I remember this happening on such a scale (it does happen very occasionally).
Oh well, this was just to let you know what happened.Tue, 11 Apr 2017 05:53:28 +0100
Although you may not notice, I’ve just put a new website live. The only major difference you should notice is that the website is responsive for smaller screens. Should you notice any problems let me know and I can fix it up quickly.
The web technologies that the site was built on were about 4 years out of date so I’ve updated to the latest stable components where I can. This has involved changing 700 files and deleting several hundred - no small feat I think.
There is also enhanced logging on the website, which is something that has been sorely missed on occasion. I should now be able to be just as preemptive fixing issues on the website as I have been on the main services.
I have completely rewritten the less/css for the site as well so you may notice a few minor appearance changes, nothing too drastic I hope.
I’ve not done a huge amount of browser testing but as long as you’re using the latest version of your browser you should be ok. For example I’m now using svg sprites for the logos, which are both responsive and look good on a high dpi monitor.
It has reminded me what a mess current web development has become with so many competing “standards” that only seem to last a few years. So although I’m up to date today I imagine I’ll have to do something similar each year to keep up.
A few issues you may encounter on smaller screens.
- Interactive help on the tool forms may appear off page
- Create a new job is vertically stacked which is a bit odd
- Some text is very large
I still have some work to finish off (like the interactive help, and removing the last icon font in favour of svg sprites) but I decided I’ve been sitting on this release for too long already so it’s best to get it out there.
This has been a pretty intense month of work doing this, but at least the website has had a thorough cleanup and I’m used to the code base again.
Just to re-iterate, if you spot any issues please let me know.#azure websites, #hipster, #grunt, #bower, #npm, #bootstrap, #jquery, #less
Sat, 08 Apr 2017 11:23:33 +0100