AIOps & Machine Learning - Extending Integrations and Event Management into Automation and Actionable Analytics

Highlights

  • What is AIOps? (aka AI ops)
  • Evanios alignment with AIOps
  • Adoption challenges
  • Live demo: centralized monitoring, common event format, event correlation, analysis scoring (trait, context & predictive analysis), visualizations (event forecast, root cause, service health status)

During this 30 minute demo you will learn:

  • What AIOps is (and isn’t)
  • Gotchas: Which architectural limitations to watch out for
  • Why automated root cause analysis, event scoring and predictive analytics are “must-haves”

Why choose Evanios?

  • Evanios is the only solution that provides robust event management, monitoring, and operations analytics (AIOps and AI ops) directly on the ServiceNow platform.
  • Packaged and supported integrations drastically reduce short and long term system integration costs, providing quick time to value.
  • Automated scoring identifies the true severity of an event based on tunable machine learning. algorithms. This automates the detective work around events, and guides operations teams to the most critical issues.
  • Quickly identify top ranked root causes for every event, and identify common root causes.

 

“We have the challenge of making sure that we become much more proactive. AI and the capabilities of Evanios are absolutely critical.”

Eveline Oehrlich, Vice President and Research Director, Forrester Research

 

“Built upon the ServiceNow platform, the Evanios approach is to fully integrate event management, automated incident resolution, predictive analytics and monitoring directly into a unified workflow. The result is an intuitive and seamless model that enables IT organizations to move past the arbitrary separation between event and incident management and instead focus on the management of services from a business value perspective.”

Charles Araujo, Analyst, Intellyx

 

[Gartner has subsequently changed their definition of AIOps or “AI ops” from “Algorithmic IT Operations” to “Artificial Intelligence for IT Operations”. See http://blogs.gartner.com/andrew-lerner/2017/08/09/aiops-platforms/ for details].


Transcript

Kai: Today we’re going to be addressing the topic of AIOps, what it is, a little bit of industry background and specifically how Evanios fits into the category. Our presenters for today are Andy Ray and Tod Patton, and my name is Kai I’ll be moderating  I’ll join you again at the end of the session. And now I’ll turn it to Andy to introduce.

Andy: I’m just going to jump right into it, I’m a partner co-founder with Evanios. Tod is a senior architect. I’m going to walk you through kind of how we see AIOps, where we feel we fit with AIOps, what’s the alignment, and then Tod’s going to go a little deeper as far as a technology demo at the end and make some of these concepts come alive.

Read The Transcript

Let me get started with what it isn’t, I mean when you see AIOps, the initial gut is to say that AI stands for artificial intelligence, unfortunately, or fortunately, however, you want to look at it, that’s not what it stands for, it actually stands for algorithmic IT operations. I think the creators of the term Gartner are okay with this confusion, you know AI is definitely a hot topic, but you know it’s not the same thing. The same time the really are parts of AIOps that start bleeding into artificial intelligence concepts, mainly in a machine learning space. But really, we’re defining it to look something a little bit different.

So, talk about what it is, and the definition directly from Gartner.  AIOps stands for algorithmic IT operations. It was coined by Gartner research recent papers originally by Colin Fletcher. It’s been adopted by the industry is a category definition and co-opted by other analysts. AIOps platforms really represent the evolving and expanded use of technologies previously categorized as ITOA or IT operations analytics. Really when we talk about AIOps, we’re talking about a category of software a way to group things, a way to look at similar pieces of software, similar capabilities from a comparison perspective. The category usually includes things like vendor agnostic data ingestion – the ability to take things from multiple sources. Data storage, big data storage – an analytical engine, typically a machine learning concept and also visualization.  There are some definitions around how those should work from Gartner, but at a high level, it’s a category with individual pieces.

Here’s how Evanios architecture aligns with that category. Starting at the top, we have the system of record and the visualization. It’s a big belief from Evanios, a part of our core tenets of our design, is that your system of record, where you do ITSM, where you keep your system engagement really needs to be inseparable from your Icom or IT operations world. AIOps is the same concept, and Gartner writes a little bit about how the interaction needs to happen. We believe that need to be incredibly seamless if you currently use ServiceNow for example, our application is focused on ServiceNow, the idea here is your visualization would be presented right next to your ITSM data, directly within your system of record. You also have the opportunity to exchange data; I’ll talk about that in a bit, as far as with a system of record to get some of that changed data, which impact data out of your system of records.

Automation, we have the ability to trigger multiple types of automation as a response to events or insights, either through the framework of your choices such as a ServiceNow orchestration or IBM dynamic ops or give you things like automatically trigger incidents, auto-close, auto-resolve things, using internal systems.

The next part. The really exciting, big part of it that you hear a lot about; machine learning algorithms. Our algorithms are based on a receive a reflection of our experience. We have unique algorithms to score events, which means to identify the events the true priority. As well as predict events providing proactive notifications for critical events where possible. These algorithms can be extended by the client to meet your unique requirements, or the machine learning can actually expand the algorithms itself.  A lot of capability as far as machine learning and it’s all built on top of the system of record, integrated into with the other core content components.

Data storage is a critical part of an AIOps system. We have big data back-end that allows us to store large history as far as metric data and prediction data.

We have a robust integration API.  This was really important to us as we’re building, we have a very open API to tie in any monitoring system or any system that generates data, as well as we have packaged out-of-the-box integrations. We have about 40 or more today the popular monitoring solutions we make it very easy to get sort of acceleration into the platform.

The last part is we also have monitoring built directly into the platform as well. While we could also tie in the directly monitoring ecosystem that already exists, we don’t want you to go backwards as far as maturity. We could tie into what you have, we also can augment what maybe you don’t have, or fill in some gaps or the ability to even ingest metrics from other monitoring systems to pull metric data, and event data into the same system to start making conclusions and insights.

At a high level this is our alignment with the AIOps category right, these are the core pieces that Gartner says AIOps system needs to have our platform needs to have, and this is how we address it .by now you’re probably asking yourself, great another marketing term from Gartner, you know my CTO is probably going to ask me about this. How does it actually affect my life? Why should I care about this? My job is already pretty difficult right. I’m going to go through a couple of different reasons why I think AIOps is important to you.

I think the first reason is you need it. Things are more complex than ever before. If you think about the evolution of IT; virtualization, SAAS, cloud, elastic, IOT and you name it right, there’s a massive amount of data being generated from all these new technologies leading. To an exponential number of events, alerts, and metrics of the process. At the same time, user expectations for availability and performance couldn’t be any higher. It’s the consumerization of IT, of our users use Google, they see that’s up 24/7, they expect their internal services to have the same quality of service. Our take on it is, the old tools and approaches aren’t up to the challenge of managing this new world.  I think from AIOps perspective this is something you definitely need to look at because I think it’s something we need as an industry.

No 2: Chuck Norris says there’s real tangible benefit, and if you guys know Chuck he knows this stuff right. Per Gartner one casual benefit is making the Holy Grail: the single pane of glass a reality across multiple technology stacks. I thought it was interesting that Gartner called single pane of glass ‘the Holy Grail’. But I mean it is difficult to get into place in a lot of environments, but that’s a key value for AIOps.

Secondly, we talked about all the data coming out of the modern IT organization. You need to drastically reduce the alert noise coming out of that. Either way, to deal with that glut of alerts and data coming out, you’re going to need beyond manual methods to do that. More automation, more machine learning to help you. And you’re going to need more tool to help you understand where your team should be focusing. Another tangible benefit really is faster diagnosis and response.  Forrester talks a lot about ‘mean time to know’ as the longest pole of MTTR. I love that term; I feel like it makes a whole lot of sense, it’s just sort of an easy way to represent that chaos when you’re trying to figure out what actually is the issue. I’m going to put labor on this to fix it I’m going to call somebody, who’s my genius guy who’s on vacation because which guy do I call? It’s not reasonable to call everybody.  The idea is ‘what’s the actual issue? Who should I call to get this thing fixed? Right so that ‘mean time to know’ is really where AIOps is squarely focused. I’m making that easier and faster.  There’s definitely real tangible benefit behind AIOps and the methodologies behind it.

And lastly you know it’s making trends in the industry, right I mean it’s definitely making waves, all the cool kids are doing it. Gartner’s estimate is 25% of global enterprises will have some sort of strategy around AIOps. And will be implementing that, in some fashion by 2019 and by 2020 they estimate that at 50%. You know I don’t know if those numbers make sense, but the good news is likely you already have a head start, you probably are you doing parts of AIOps, like monitoring or event management. But maybe you’re not connecting all the dots in an AIOps context. You’re not actually using machine learning, actually putting the pieces together, that’s where AIOps really comes in, as the newer technologies help you do that.

Alright, we talked about why you should care, now we’ll talk about what some of the challenges might be around adoption. And it’s always interesting to put up here; how you can fail. The idea is we need to get ahead of these, mitigate these issues. I’m going to go through a couple of the issues that we’ve seen, but at the same time, there’s not an exhaustive list. There are many ways you can, be challenged with this within large organizations which are complex. But the same time I mean these are things you definitely need to keep an eye on.

The first thing is a poor integration that’s one of your biggest challenges. The phrase garbage in. Garbage out works pretty well. If we have bad data coming in, it’s hard to produce any reasonable insights out of it. But it also makes sense for if you get nothing in you get nothing out. If at the end of the day but you’re not tying into critical systems that you actually need that data from, you don’t have a complete picture. There’s no possible way that on the top end, the analytics can make sense for you or get the value you’re expecting out of it.  I think that integration level is absolutely key. I put up here ‘you need good data, lots of it’.  It’s also important that when you get that data in that, you normalize the data, and set a quality level where it’s usable.

The next piece misaligned expectations. This actually came from Gardner, which I thought was great, they listed as the number one reason AIOps implementation challenge: ‘separating the marketing jargon from actual capability’. Which I thought was kind of interesting, as a consumer it’s important for you to actually do your homework. You need to test your vendors. The good news is this is done fairly easily. It’s not like the old days where we have to allocate $20k worth of hardware and get it all wrapped and stacked, and then get a consultant out to install the software, and all those pieces. In most of these components like Evanios has a cloud side of it. You can actually be up in a few hours. The reality is most of these platforms have some sort of SAAS component. You should be able to try this out and check it out within a day. Of I would say if you can’t get it done within a day, maybe you should look at another product, and maybe it’s a little more complex than needed, probably not something kosher. We welcome the challenge at Evanios for you to test us, all the other vendors should expect to be tested as well.

Last part here: misplaced fear; the idea that maybe a solution will eliminate my job, or maybe the last tool that we bought that made these kinds of promises didn’t work very well, or just general reluctance to change. These are the kind of things you always hit with specific types of projects right. The job one is actually kind of interesting we’ve done a lot of research on our site about that. Our take on it is that as AIOps solutions and more automation come into the IT operation centre, the operator’s job will become more cognitively demanding. AIOps is a power tool they’ll be using and actually raising the value of the organization. The idea is, instead of maybe doing simple manual labor they’re actually stepping up and having to do more doing troubleshooting and tying together multiple correlated and pattern insights. I think it’s actually a great thing for labor. There’s an awesome TED talk on a similar subject by David Otter called ‘why are there still so many jobs’ and he’s basically talking about the impact of automatic tellers on the banking industry and how there’s still plenty of banking employees and there’s many more than there was is now after the implementation the automatic teller. It talks about what those people do and how the jobs change. It’s a good analogy for how we work at work and IT. as far as the other one, the previous product non-performing this would never work here; POC demonstration, talking to reference customer’s analysts. The only thing I can advise there is put your stakeholders put those fearful people in the boat with you really early, they’re involved in that alignment, they can see and test what they’re going to get. I think that helps everybody get on the same page, set reasonable expectations for these types of projects. You know I’m a big believer in momentum, start with a small project and have success with that, then move to next one, move the next one etc. Get them on board for that first project where they say, ‘yes this works great, it can’t possibly fail’ do that and then go to the next level.

Alright, talk about how Evanios accelerates AIOps implementation. Specifically, about us, what we do different, how we see things. AIOps is built on the familiar system, integrations monitoring, duplication, correlation that’s been around for a long time. In order to get that higher-level functionality, you have to make sure that you have those basics right. That the architect wells from the bottom up. Some of the ways in our platform helps you do that include;

  • Easy integration just about any tool you can think of
  • Metric and event congestion capability
  • Robust normalization to a common event format and we still retain the raw data for a drill-down
  • Mature deduplication and event correlation – we have some great scoring and unique scoring technology that is really interesting, it’s sort of a way to present insights in a very simple way.
  • And also we have a very tight integration with a system of records, it allows for AIOps and ITSM to be inseparable and working or changeably.

We have a lot of ways we can help you do that, and all of this will helps you create a single pane of glass (which Gardner referred to as a holy grail) very quickly. At that point, it becomes a very easy to layer on our analytics, and create meaningful correlation insights like; root cause analysis, impact analysis and predicted. At this point, that’s kind of it for me as far as its laying the context of Evanios; what we do, how we accelerate AIOps. I’m going to turn it over to Tod to show you some of this in a quick demo.

Tod: all right, Andy mentioned our robust integrations. This is just a list of our packaged integrations. We have integrations for all the major monitoring tools, whether they’re on premise, they’re cloud-based, doesn’t really matter. We have open APIs for any custom monitoring tools. These are our packaged ones these are downloadable configurable. This would fit into any POC very easily installed and configured.

Read The Transcript

This is our demonstration ServiceNow instance; this is our application in a dashboard view on the ServiceNow platform. our application is installed and lives inside of ServiceNow, and this dashboard is trying to show lots of different data, from lots of different sources, some of them are us, and some of them are ServiceNow directly. The news widget, for example, is a standard ServiceNow offering. This prediction view in the upper left is our prediction of events; events by application and events by host. We also are pulling in raw monitoring data that these two graphs represent, and I also have a list of problems and a list of incidents from the problem management system and the incident management system. I can bring all this together in one dashboard because we live inside of ServiceNow.  I have access to the CMDB; I have the ability to create incidents, I have access to the change management system, to potentially look for blackouts. These are all just features and functionality that are that are available to us because we live inside of ServiceNow. And that’s kind of event management 101. For the remainder of the time, I want to show some of those higher end features that Andy talked about in our application; we monitor Windows, Linux, Oracle, Sequel, and we pull all that through an agentless solution. As we live inside ServiceNow, we look like everything else in the ServiceNow, where this is just a table of records. We put a very fancy GUI in front of it, to make sure that it’s real time, and we’ve added a lot of features to this. And these are events that I’ve received from all of my different marketing tools, I have all these different tools integrated into my environment. You can see that we have criticality repeated events, the severity is all color-coded.

The first thing I want to talk about the analysis scoring, so this column here is showing the analysis score. And that’s a key differentiator for us and something that our customers use on a regular basis to drive their operations. The higher the analysis score, the more relevant the event should be, and while we were doing this analysis behind the scenes, we make it transparent and configurable. So, let me get transparent in that all of the traits that we use for the analysis are in a table and you can see the records, adjust the record, turn different traits on or off, or add new traits based on your use cases. The other place where we’re transparent is in what traits were attached to this event, so these three traits matched the description is very English readable; multiple business services affected, how many? 51 causes incident you create two and 30% of the time, okay that sounds relevant, and it’s a critical event. And here’s the score adjustment for each of these matching traits and we have twenty out-of-the-box traits that we provide to you, and only three of them were applicable in this case.  This gives us a score of 380, the overall score was 30,000, so there must be something else going on.  That’s just right at your fingertips as well; predictive analysis. Our prediction engine is very easy to explain, and while it is dependent on machine learning, it’s not voodoo, it’s not magic. I can explain it to you in a few sentences, and you’ll understand the concept. This is a good example I have my event, this information up on top is a common event format. Here are the Scomm details available to me as well. But back to the prediction, how we predicted this event is we have a data store of events signatures. We take a signature of every event we received, and we build relationships between those events based on time. If we see an event happen, we see another event happen after it links those together. And as we see those linkages happen more often the probability will go up, and when the probability gets high enough we know that those relate those two events are related. In this case, replication fails, and when replication fails, the sequel database has an issue, we’ve seen this happen X number of times in the past, we see this happen 12 times in the past. We have causation to know that this has happened in the past with regularity, we can predict it as going to happen in the future. While we got the replication issue, we’re going to get the sequel database issue. And we have a visualization for that in our forecast view. These are all the events we’ve predicted. There’s a grid of time versus probability, and if you can click on any particular dot, you’ll see the leading indicator, which is an actual event that we received. And then you’ll see the predicted event that we think is going to happen because we’ve seen it in the past. There’s some kind of relationship in our database.

You can see those relationships with this interactive view; this is just a view of those relationships. And while we’re predicting events that are going to happen in the future, we also can show you real time events in your environment that are potentially root cause analysis, cause events, and effect events, and graphically display these using the same information. The same signatures that we store in our database will drive this particular view as well.

I have a few more things to show, and then we’ll open up for questions. Because we have access to the CMDB, a lot of our customers do service mapping. These are business services that are defined in my CMDB, most of these are just canned out of the box. But we can represent these and attach them to events and show where there are potential issues.  in this example, I have 82 total business services, but 21 of them have a critical event attached to it. And we actually do business service mapping, we draw the business services based on information that’s in the CMBD, and we do this all on our own. ServiceNow has their own maps, but we didn’t want to be dependent on that, we built this technology to draw the maps again, and therefore we can color code them based on the severity of events they’re attached to specific CI’s. That explains a couple of things here we attach CI’s to events. And also we can represent these business maps and show the criticality of the events that we received. This is a real-time picture of what’s going on with this particular business service; solid red boxes have critical events attached to them. Things with red outlines are perceived degradation. We know something downstream is having a problem, so we perceive that that’s having an issue.

And the final piece if you go back to our all active events console and show in our world – correlated events. If you can get to this level of maturity, you’re really doing things right. Correlation requires lots of data, as Andy mentioned, and it takes a lot of niche events from lots of different tools. Most of our customers have five or more monitoring tools, and one of them is monitoring network, and another one is monitoring their servers. Well if you have a network outage of any sort it’s going to look like all of your servers has gone down. You can take those two disparate events and correlate them so that you have a single event. That’s the Holy Grail as Andy mentioned, where we had a single pane of glass that’s receiving events from multiple different tools. And then being able to correlate them with our correlation engine to link them together, attach them all to the same incident. And treat them as one entity, one parent-child relationship. Then does that through one-to-one this particular parent has five children. And this is our correlation view component of our console, and that concludes the demonstration.

Andy: so, I love action items and here’s what I’d recommend for you as the attendee what your action items are. I would say

  1. develop the strategy around AIOps – you need to figure out is this something you can get value out? Is it something you need? Happy to help you with that. There are consultants who can help you with that; there are analysts who can help you with that as far as a strategy. But at a high level you need to look at it, and in general, if you’re medium or large size organization I don’t feel that you can afford not to have some sort of strategy around AIOps
  2. then the next part is you survey your current tools and capabilities identify gaps. Can you get there from here using your current tooling? If you can great, make sure that you’re charting the project and appropriate people to get there. If you can’t, as we mentioned before start engaging some vendors for proof of concepts, see what they can do or how they can do it. Also, you may want to look at it is if you have current pooling in house, can cost-effectively get where you want to be? Do you want to put a lot of money into an older system that maybe hasn’t performed for you in the past, or you want to look at something new and transfer over now before you start investing in it too?
  3. Get started. I love the quote version one is better than version none, I love momentum. If you guys can start a small project. That says for this application we’re going to focus on prediction. Or for this particular business service, we’re going to focus on correlation. And this is the value we think we need or this is the use case that we need. That goes well right, and then you can take that momentum, present to the executives and drive to the next level. That’s really the way to run projects in 2017. And so, I recommend starting small and getting started sooner. Kai says’ we’ve got some questions so if you want to take over.

Kai: yeah, we do, Andy the first one is, is AIOps a rip and replace strategy.

Andy: the answer is definitely no. If you already have monitoring tools you like you should be able to leverage those. One of the key components of AIOps platform, even as defined by Gartner, is a vendor agnostic data ingestion, if you’re dealing with a vendor who’s flying the AIOps flag, and it’s telling you to rip everything out and replace it with their tool they’ve missed something. I think I would question anyone who asks you to go backwards in maturity, in order to go forward.  don’t get me wrong you just have to make sure that has sound financial and business justification. But a short answer is no it’s not a rip and replace strategy. It needs to augment the maturity you already have.

Kai: this was actually similar Andy, so this is specific for Evanios since this is about integrating with home-grown monitoring tools. We hear that quite a bit.

Andy: yeah sure, we do it all the time, home-grown monitoring tools, one-off monitoring tools, industry specific monitoring tools, custom build applications.  we have a very robust API that could tie into any data source, and really the history that API is doing consulting as a team over the years around, event management, event correlation, and monitoring. We basically took every API we had ever seen to integrate a monitoring tool, into an event tool or data aggregation tool and we made that API. Right so, we have everything from SMTP, TCP, UDP, log file, Windows Event log, rests, SOAP. You know you name it, if you can get the tool to expose data in some way, we can definitely collect it, and that’s the API that we built our packaged integrations on top of. Those same power tools that allow us to build integrations very rapidly, those are available to you, and to our consulting team work with.

Kai: thanks, I’m going to fit one more in because I think that this applies to basically everybody. If we speak based on my own experience; this person wrote in asking how useful AIOps, or even analytics would be if you don’t have a mature CMBD.

Andy: yeah this is kind of an interesting thing right. I mean realistically there are plenty of Fortune 500 companies that operate really highly available environments without a CMDB. I know that sounds bad to say that but the reality of it is, it’s just another area that you need to work on maturity. And it’s very helpful to have a CMDB, is very helpful have a quality CMBD. It’s definitely something you want to strive for, but I don’t believe it’s something that you want to stop all other forward momentum to get. You know it’s not; do the CMDB first, and then let’s do everything else after that after we’re done- perfect we’ll move on. The way you want to approach it is; either go by application or application or going in parallel efforts. How can we kind of move all the bars up at once? We need to CMDB maturity move up, the event management maturity up, move the analytics maturity up. All at one time. Because if not I think you’re going to be caught in a sort of a chicken and egg situation where you’re just constantly waiting for the CMDB.

Kai: Fantastic, thank you

Close The Transcript

Get started now
Book Demo