Friday 22 September 2017

On the burden of unjustifiable human interfaces

"If something doesn't have an API, it doesn't exist!"

For me, that quote form Mitchell Hashimoto represents one idea: If you don't automate access to resources, you'll be extinct.

It's not about cloud, containers, microservices, serverless technologies, or any other 2017 buzzword that you can come up with.

Those are not the aim. Those are simply tools to help you free up access to resources, to help things move faster.

Below is simply an example (rather specific, albeit) of how inaccessibility of resources can reduce productivity.

1. The problem

 

1.1. The context


A "Trade" data warehouse, running on Oracle 11g database. Moved to Exadata recently.

Multiple types of financial instruments, multiple types of trades, years of data [sub]partitioned daily.

Example of oddities: One of the (badly designed, in my opinion) "dimension" tables holds tens of millions of rows!

Countless number of hints in every SQL statement to make hundreds (if not thousands) of queries per hour work on the legacy (i.e. non-Exadata) system.

Prospects of changing the data model: almost zero.

State of Preprod/UAT DB: much less volume of data, not receiving as much data as the Prod by a large margin.

State of Dev DB: Almost non-existent.

The "DBA" sits in a completely different environment. Might vaguely know about the data model. It's -supposedly- a big corporation.

1.2. The daily workflow

 

Customer complains about queries running slow. The complaint is sent to Dev.

Dev are the people who have written the SQL statements. They know about the data model, with good knowledge of Oracle internals. Dev has no access to Cloud Control or DBA workflow; for example no idea when the backups run last, or if it was running at the time of an incident.

DBA has no visibility over the complaints or DB performance. And there's no point in sending the complaint to the DBA: He can't qualify a SQL statement as "problematic" or "OK," if the query is taking too long to complete, because he doesn't know the data model.

1.3. The example scenario

 

An issue is raised. Dev checks the SQL statement.

As a matter of principle/process runs the statement on PreProd. PreProd doesn't have the same characteristics of the Prod, so the issue doesn't always reappear.

Dev is aware of the qualities of the PreProd DB. Also the customer's not delusional; something IS running slow.

Just imagine that when it comes to executing a join on (e.g.) 10 different row sources, CBO had (twice) chosen to do Nested Loops and tens of thousands of Range Index Scans on one of the massive tables because of incorrect stats on the driving table. Can't get clearer than that, can we?

Trying to establish whether the behavior can re-occur (are we actually dealing with a systematic issue? or just a one-off anomaly?) Dev re-runs the statement on Prod. CBO chooses a "bad" plan, query runs like a dog, can't cancel the query on SQL Developer.

Obviously something's wrong. Any "fix" should be re-examined.

Including the re-examination (if successful, that is) the query has run 3 times on Prod.

SQL Developer is still hanging.

Just a reminder that breaching SLAs in financial sector can lead to fines of millions of pounds by the regulator, at least!

This has to be fixed quickly.

Dev (the same person who understands the data model, knows what's running on his SQL Developer session, and most probably knows why it's taking too long) asks the DBA to kill the session.

The DBA (not familiar with the data model, unaware of the activity): "Could you raise a ticket, please?"

Tick, Tock, Tick, Tock, Tick, Tock...

2. Red Lines and questions

 

2.1. Prod != PreProd

 

In my opinion that's the root cause of the whole issue.

If you're big enough to be in a sector with SLAs which puts you in danger of loosing millions of pounds in fines, you should be big enough to have the EXACT SAME environment in PreProd, as in Prod. If you don't, you're embarrassing yourself.

If you're not a software vendor and your organisation effectively owns the production data, perhaps you should use Data Virtualisation tools such as Delphix.

However, this factor is simply an easy target to criticize; especially if you don't have to deal with it from a pragmatic angel (because you'll not suffer the consequences of missing the SLA, for starters).

If you're not an architect of this system, or you're brought it long after the date that system has become operational, you have to make lemonade using the lemons that life has given you.

2.2. Auditing for the sake of auditing

 

  • Why are "kill this session, please" tickets raised? Are we simply trying to serialize DBA's workload and avoid him getting overwhelmed by a flood of requests every minute? (If yes, perhaps he's controlling too many resources.)

  • Has a "regulator" asked you to do this? If not, is there a clear process of review?
  • What is the value in raising those tickets? 
  • What's the action plan for reviewing those ticket? 
  • Can you set up a down-stream automated process which processes the raised tickets for later analysis? 
  • Is the underlying data in your ticketing system accessible in a programmable way? 
  • Would you be able to chart/categorize your tickets easily? 
  • Can you use your ticketing system to produce quantifiable indexes to measure trends in your organization?
  • How long do you keep those tickets in your system? Why? 
  • How do you ensure that such ticket won't have the same fate as billions of tourist photos which are taken every day which won't be EVER revisited again?

2.3. Role of DBA

 

  • What kind of DBA deals with such tickets? Is he/she a DR/HA DBA? If they're not application DBAs (who's familiar with data/business model), why not?
  • What is the process of communicating with DBA? How many DBs he's responsible for? How many teams/products/services is he responsible for? Does Dev know about his responsibilities? Is he part of the team?
  • How does he contribute to [increasing] productivity? Has he turned to a bottleneck? If yes, why? Too much responsibilities? Is he not eager to be part of the product?

2.4. The value of involving DBA in menial tasks

 

  • Why should the DBA deal with such tickets? What's the difference, if Dev does it and it's logged?

If you are a DBA who knows that menial tasks can be automated AND logged in a way that satisfies "process" requirements and you still choose to receive tickets, perhaps you SHOULD feel threatened by the notion of "autonomous DBs"...

The world has changed, everybody is automating menial tasks, anyway!

2.5. The value of being involved in "non-menial" tasks

 

  • If you're the only person who's privileged to (e.g.) deploy a package, would you be held responsible, if the package causes data loss? If Dev will eventually be held responsible, then why Dev is not deploying the code?
  • Do you -the DBA- know how use tools such as Jenkins to enable the Dev do the deployments in a safe manner? Do you want to enable Dev to move things faster in a safe manner? or you feel too proud of being the only person who does the deployments?
  • Do you -the DBA- do code reviews? Do you clearly know what does the code that you're deploying do? If not, then why are you deploying it,  anyway?!

Conclusion

 

I'm sure all of us can see the problems hidden in the answers to some of the questions above. Obviously the elephant in the room is the segregation of DBA and the development process.

But if you can't embed the DBA in your development team, my solution is "Provide an API for your resources!"

I'm not sure how many of us can justify the delays caused by having a human interface for a resource which doesn't need one. In my opinion the notion of "raising a ticket will only take two minutes" is simply not true, at least because if the DBA is not busy to be able deal with his incoming notifications immediately (is he/she EVER?!) it would require another two minutes to deal with the issue, and another two minutes to reflect the outcome in the ticket, which is subject to all the questions I mentioned in 2.2.

This means your Dev has had an opportunity to distract himself either with another task, or Instagram! Context/task switching in humans is an extremely costly factor in production, and it's by no means as efficient as context switching in CPUs.

That applies to DBAs too (surprise!) since they're not sent from above. When DBA receives tens of notifications every minute for requests which can be automated, he/she have to deal with context switching as well.

Are your reasons for keeping the human interfaces to your resources justified? When was the last time you did a soul searching and self examination to re-examine your reasons?