Development blog

More website updates

This update changes the way help is offered in the tools, rather than via hover pop-up it now uses modal windows. This works much better on both smaller and larger screens so I’ve been able to spruce up the appearance of the help.

Also in this update is a customer API for listing scheduled jobs, creating new jobs (sql azure to azure only currently), cancelling jobs, and retrieving jobs. This is a v0 of the api and if you want access to it let me know and I’ll send some documentation.

The plan is to encompass all the job types and use the api in the webpages directly and maybe use these new objects in the database itself as there are some scaling issues I need to address and now is a good point to change the serialisation.

So again although it looks small the new api is many thousands of lines of code that are a parallel system currently.

That was pretty intense.

#webapi, #inthezone, #sqlazure, #backups
Thu, 13 Apr 2017 13:31:36 +0100

Servers rebooted...twice

Well that was nice, looks like Microsoft Azure decided to reboot my main worker virtual machines in the night, twice.

I suppose I shouldn’t complain too much as this is the first time I remember this happening on such a scale (it does happen very occasionally).

Oh well, this was just to let you know what happened.

Tue, 11 Apr 2017 05:53:28 +0100

New website

Although you may not notice, I’ve just put a new website live. The only major difference you should notice is that the website is responsive for smaller screens. Should you notice any problems let me know and I can fix it up quickly.

The web technologies that the site was built on were about 4 years out of date so I’ve updated to the latest stable components where I can. This has involved changing 700 files and deleting several hundred - no small feat I think.

There is also enhanced logging on the website, which is something that has been sorely missed on occasion. I should now be able to be just as preemptive fixing issues on the website as I have been on the main services.

I have completely rewritten the less/css for the site as well so you may notice a few minor appearance changes, nothing too drastic I hope.

I’ve not done a huge amount of browser testing but as long as you’re using the latest version of your browser you should be ok. For example I’m now using svg sprites for the logos, which are both responsive and look good on a high dpi monitor.

It has reminded me what a mess current web development has become with so many competing “standards” that only seem to last a few years. So although I’m up to date today I imagine I’ll have to do something similar each year to keep up.

A few issues you may encounter on smaller screens.

  • Interactive help on the tool forms may appear off page
  • Create a new job is vertically stacked which is a bit odd
  • Some text is very large

I still have some work to finish off (like the interactive help, and removing the last icon font in favour of svg sprites) but I decided I’ve been sitting on this release for too long already so it’s best to get it out there.

This has been a pretty intense month of work doing this, but at least the website has had a thorough cleanup and I’m used to the code base again.

Just to re-iterate, if you spot any issues please let me know.

#azure websites, #hipster, #grunt, #bower, #npm, #bootstrap, #jquery, #less
Sat, 08 Apr 2017 11:23:33 +0100

Partial Azure table backup

So this is a big one. I’ve added the ability to specify a table filter when backing up Azure table storage. This adds a lot of new potential use cases for the backup feature.

Firstly, this is an advanced feature so if you’re not confident try playing around with queries using the rather excellent storage explorer from Microsoft .

You can for example backup only rows that have been modified in a certain timeframe based on UtcNow.

eg PartitionKey eq ‘richard.mitchell@cherrysafe.com’ and StartTime gt datetime’%%DateTime.UtcNow.AddDays(-30)%%’

You could also remove old rows from the table by specifying remove rows and also something like.

eg Timestamp lt datetime %%DateTime.UtcNow.AddYears(-1)%%

This is of course dangerous so be careful. Azure table queries are notorious for timing out and being case sensitive so be sure your queries return in a timely manner and are formatted correctly. For example an easy trap to fall into is to not include the type specifier before a datetime or other type.

Have a very good read of the MSDN documentation on Querying Tables and Entities and make sure you understand that certain queries are much more intensive than others.

Also included in this release is hopefully a fix for a deadlock that was happening very occasionally when backing up a table and a retry was requested. It’s something I’ve been keeping an eye on. There was also an issue where cancelling a table backup job would continue to fail each table separately which I also believe I’ve fixed.

Lastly the glob match feature and the new date azure table filter preview are now performed on input rather than having to press a button thanks to the rather excellent jquery.bindWithDelay.

#azure, #azure table backup, #filter, #retention, #scary
Thu, 16 Mar 2017 11:52:31 +0000

The little things

Tiny update this morning, mainly to test my deployment mechanisms for updating the website. This just moves a few callback apis into proper web apis, including changing the url of some Azure AD and FastSpring callbacks.

This starts my move to web api which is a precursor to adding a customer api.

My website deployments used to take 20 minutes and now they take under a minute (including swap time). Got to be happy with that. The difference between classic web roles and azure websites.

#azure, #azure websites, #thelittlethings, #pre breakfast
Tue, 14 Mar 2017 08:56:10 +0000

DNS Switchover

Just kicked off the DNS switch over to the new Azure Website, this should happen over the next few hours. If you experience any issues flush your dns cache or use cs-website.azurewebsites.net in the interim.

I’ll be shutting down the old web role later today. Still, what are weekends for?

Sat, 11 Mar 2017 06:23:56 +0000

Website to Azure websites

In case you didn’t realise it but Cherry Safe in it’s current and previous incarnation have been around a while. As a consequence sometimes I need to move sideways. Today is one of those days as I’m moving the website from an Azure Web Role to an Azure Website.

Luckily I can feed some of the learning from recent work on the monitoring website into this move.

So tomorrow morning (UK time) I’ll actually move the DNS for the website over to the new location. It does seem to be working once I found the appropriate documentation - https://docs.microsoft.com/en-us/azure/app-service-web/app-service-custom-domain-name-migrate

One of my pain points has been updating the web role as it takes about 20 minutes with a small amount of downtime everytime (even though this shouldn’t happen, it does). This should make things much simpler, quicker and more reliable.

You can try out the website now if you want by visiting https://cs-website.azurewebsites.net/

It’s a bit scary, hence me doing the actual move over the weekend to make sure everything works out. Once I’m sure it’s all ok I’ll turn off the old website.

#azure, #nail biting
Fri, 10 Mar 2017 10:53:05 +0000

Scaling for restore

I’ve decided to take a break from sorting out some of the code issues for now to develop a new feature.

This one has been waiting for a while and allows Azure SQL Database restores to be restored at one tier and then reduce the tier once the restore has completed.

Just configure your restore tier and then final tier. Once the restore has completed the system sends the request to scale to Azure and completes immediately without waiting for the scaling operation to complete.

Hope this is of some use.

#quickwins, #azure, #sqlazure, #restore
Mon, 06 Mar 2017 14:17:50 +0000

Cleanup #3

The “last for now” cleanup has just gone live. This has removed all the non-microrun mechanism for running jobs which was just making the code harder to understand. I’ve also taken the time to rewrite the monitoring website for the system which also had several thousand lines of unused code.

There shouldn’t really be any visible changes as a result of this release as it was mainly internal housekeeping, more tests, cleaner code, and simpler deployment. There was however a very intermittent error (a couple of times a month) where a job would fail to start due to a poor query - this has also been changed for a hopefully more efficient query.

I’ve also taken this opportunity to update the windows services on the microrun machines as although there wasn’t an issue with them it’s nice to run newer code - the previous release was June last year of that component.

I did get a little distracted trying to sort out the history table - a mere 60Gb and rising fast - but I’ll save that for another time as I really need to change the clustered index of the table to be able to get it under control.

As an aside I’ve also starting playing with my 3D printers again. It’s nice to feel I have time to for hobbies now things are a bit more stable. The microrun system is really the best idea for maintenance and development that has been done for the system since it’s inception.

#azure, #cleanup, #3d printing, #morecoffee
Tue, 28 Feb 2017 10:36:22 +0000

Cleanup #2

Makes a change to do a morning deployment as I normally wait for the quieter period in the afternoon to do these things.

This cleanup is a large scale change of namespace of live source files (I’d already changed tests a while back). Mainly this was an automated update to over 500 files but there were a few special fixes required to keep serialization and type loading backwards compatible. Also another large set of C# fixups were in this releases, such as enforcing { } around all blocks.

About the only effect that is visible from this update is that the stack traces will properly report CherrySafe :)

Next cleanup is to remove the old (now unused) way to run jobs. I may also look into changing the website deployment which would lose it’s fixed IP in the process I believe.

Fri, 17 Feb 2017 10:07:49 +0000

Cleanup #1

This is going to be 1 of about 3 releases that are just a lot of housekeeping. I finally caved and bought Jetbrains Resharper and have cleaned up a lot of code. No real issues were found, mainly it was removing unreachable code and using a lot of string interpolation instead of string.format() and use of var.

I also fixed an issue seen in testing where during update if the system crashed it would leave behind a partially extracted new executable which would stop jobs from running. My alerting would have informed me if this ever happened live and it’s a very small time window but I thought it best to fix as I can now release so quickly.

Lastly there is an attempt to fix non-critical errors I’m seeing using the new ARM API, these only show up in the system logs and don’t affect the operation of jobs. I believe this is due to information missing from the Microsoft documentation.

Actually lastly I rewrote my deployment scripts as there was a lot of cut-n-paste in there and with the new ability to deploy schedulers I thought it was time to consolidate.

Next up some work to rename the namespaces and clean out the old (pre-microrun) job running code. This is all working towards a big rewrite of the job history code which is overly complicated and not ideal for many remote micro-runners.

#springcleaning, #azure, #resharper, #feelingill
Mon, 13 Feb 2017 10:52:12 +0000

Microrun v2

Nothing should be user visible in this release as this was mainly working on the micro-run system that I introduced last year.

This introduces a micro-scheduler which makes it much easier to update more of the system and should allow me to remove a whole section of complex code that is now unused. Easier deployment should also allow me to start fixing up the database as there are some design choices made early on that make it larger than necessary and for example it’s hard to change user login addresses (don’t hold your breath for that, if you do want your login changed just email support and I can do it for you for now).

It also adds the ability to run jobs in more data centers as they are much easier to deploy - I’m still testing this but I’m thinking West US as the next location.

I also updated a few nuget dependencies but nothing visible.

So in effect there should be no visible changes in this update.

#microservices, #azure, #boyscoutrule
Mon, 06 Feb 2017 15:00:09 +0000

Canada, UK, etc for databases

This has taken far too long to get live, primarily because I wanted to be as sure I could be that I didn’t break anything.

But finally the system can perform Azure SQL backups in Canada, UK and other newer datacenters. You will have to link to an Azure subscription as it uses the Azure ARM api to do the deed (the same way the new portal works). This can also be used in older data centers if you want to, although for now this isn’t required (Microsoft may change that at some point in the future).

There were a lot of underlying code changes to support this as you can imagine, a lot of the groundwork was done when I added subscription support for storage accounts a few releases ago.

The reason that it didn’t work before is that Microsoft declined to make public the import/export endpoints in the new data centers so this work had to happen.

There have also been a couple of minor fixes in the intervening time to fix up some issues with sync’ing storage between Azure and S3 around files with square brackets and some meta-data name translation issues.

#azure, #scary
Tue, 24 Jan 2017 13:27:59 +0000

A little bit of everything

Happy New Year, thanks for listening :)

Today’s update brings a little something for everybody. It’s another fairly big update even though I took a bit of a break over the Christmas/New Year period which I hope you’ll forgive me for.

Firstly, I’ve done some work to Restore database jobs to hopefully improve it’s robustness in the event of external failures. A lot of this code was borrowed from the backup job and it follows the tried and true cloud programming paradigm - “everything fails”.

In the Sync storage there are a couple of new features. There’s a little bit of UI that allows you test the behaviour of glob matching - thanks for a customer for suggesting it.

I’ve also added the ability to fail sync jobs should any copies be skipped. This can happen if the source file is deleted between when the file was listed and the time it comes to copy it - the option is “Treat skipped files as failure”. Also the email alerts now include skipped file counts which was an oversight.

Azure Subscription links are now marked as (optional) rather than (new) in the UI as there was a little confusion for new users.

Customers in the German data centers can now connect to storage accounts. This was due to the separate endpoints in Germany (core.cloudapi.de vs core.windows.net). To get this to work you specify the full account endpoint for Germany accounts eg account.blob.core.cloudapi.de. The default behaviour doesn’t change for the more common case of core.windows.net.

Lastly I’ve made a few internal changes to reduce my error count where these are customer configuration issues - for example missing containers in Snapshot jobs.

#happynewyear, #germany, #azure, #glob, #restore database, #sync
Wed, 04 Jan 2017 13:46:19 +0000

Azure AD integration for Storage

The largest update in a long while, but now you can integrate Cherry Safe Backup directly into your Azure AD. This means that it no longer stores Access Keys but retrieves them as needed instead. You can also turn off access by removing the app from your AD or removing the linked subscription.

This has taken a lot longer than I thought - mainly due to testing and ensuring the integration was as smooth as possible. If you want to modify existing jobs you can simply link the subscription and edit each job selecting the subscription for the storage and re-schedule. Really simple, I’m quite proud of it.

If you want to further restrict the role you can, the error messages from the system should make it obvious what resources were requested.

Azure AD is a complex beast so if you hit any issues please let me know.

Also in this update…

There were issues when setting up Azure Files sync that it would retrieve containers in the UI rather than file shares.

URLs in the UI for storage account name should be blob/table/file depending on which page you’re on.

Performing retention policies for Azure SQL Database and SQL Server now restrict file listing easing memory pressures.

Extra example for glob file limitation - 2016*/**/* - only list and backup files in folders starting with ‘2016’, includes directores like '2016-10’.

SQL Server restores were assuming block blob which was breaking, now assuming page blob.

SQL Server retention would fail if a lease was held on a blob.

Failing a SQL Server backup attempts to break the lease and remove the failed page blob .bak file.

SQL Server backup now supports Express edition (this was primarily for my testing).

General wording cleanup on support and features page.

Huge underlying refactor to support new storage creation mechanism.

Time for a drink.

#azure, #azure active directory, #azuread, #ineedadrink
Fri, 16 Dec 2016 10:57:20 +0000

Azure SAS Token

This is the first large update for a while but hopefully it’s been worth waiting for. 

This adds support for Azure SAS Tokens for storage backup jobs, tables, blobs, files, and snapshots.

The system recognises SAS Tokens as generated by the new portal - they must start with ?sv= to distinguish them from access keys. There is full documentation available on MSDN https://docs.microsoft.com/en-us/rest/api/storageservices/fileservices/Constructing-an-Account-SAS?redirectedfrom=MSDN

For permissions you can restrict to HTTPS (something the system has only ever used). For source you will generally need the service (eg blob), resource types (service, container, object), and permissions (read, list). Target
storage additionally needs mostly full permissions (delete only if necessary, eg snapshot removal). The errors may not be ideal when the permissions are not permissive enough - 403 being the common failure.

This update was also compiled against a newer version of the Azure SDK after my main development machine decided to die about a month ago and I couldn’t download and install the older SDK. It’s taken quite some testing to ensure I’m happy with it as I remember all the pain points last time I updated the SDK. The only gripe I have this time is the “default on” switch for application insights - something to be careful if it’s not something you want.

Been a stressful time getting this release ready, hope you like it.

#azure, #token, #update, #stressed
Thu, 17 Nov 2016 14:59:36 +0000

The pain of DTU

This week has not been a good week. I’ve broken two computers, had to re-install one when an SDK update went bad (be very careful with the latest 2.9.5 SDK as it doesn’t play nicely SxS), and hit numerous issues - the worst of which was failure in monitoring.

Due to query for a particular user causing high DTU on the backing database there were occasional issues when starting jobs. Normally the user would be notified of that failure, however there was a bug if a certain database call failed. This would manifest by silent failure of a job run, it would fail to start the schedule properly. The fix was two-fold, both adding indexes to the database (the Azure portal got the indexes spot-on), and also ensuring a failure message would get sent if the job failed to start.

A related issue that only affected two jobs, from what I can tell, is that a job would start and then a different call would fail causing the system not to en-queue the work. This had the effect of stopping schedules for the job. From now on if a job has a null heartbeat or an old heartbeat it is deemed to have failed and the schedule will run as expected.

I’ve also updated the Azure SDK to hopefully fix issues seen with table backup.

Lastly a fix also went in a couple of weeks ago that meant the incorrect database could be restored if multiple backups were stored in the same location with dashes in the name. Now the latest backup is determined by the current timestamp in settings.

So why didn’t I spot the issue sooner? It turns out that I was missing a type of failure in my monitoring system. This has now been fixed so I should get alerted to anything like this happening in the future.

Now to calm down a bit, as you can imagine, it’s been a pretty stressful week.

#azure, #monitoring, #dtu
Thu, 20 Oct 2016 15:59:25 +0100

Azure alive

Looks like Azure is mostly alive again. I believe all running jobs may have failed during the outage (not being able to see a database will do that to something).

Let me know if you see anything else.

#im still alive
Thu, 15 Sep 2016 15:48:45 +0100

Azure DNS issues

Just to let you all know that Azure DNS is currently having issues that are knocking on the system.

If you’re experiencing problems please keep an eye on the Azure Status page to see what the situation is https://azure.microsoft.com/en-gb/status/

#azure, #problems
Thu, 15 Sep 2016 14:19:33 +0100

West Europe taking hold

So the West Europe deployment has been such a success that it has started to hit the limits on the deployed resources. So this morning I just increased the resources available which should reduce the delay in jobs getting dequeued that has been plaguing jobs in the morning this week.

There are a few new Azure features I’m interested to explore to look into improving the ability of the system to scale. We’ll see where that takes things or if I should just continue scaling in the current way.

#azure, #scaling
Sun, 04 Sep 2016 08:56:02 +0100