# Using DensityFunction for Outlier Detection in Splunk

### Introduction to Outlier Detection

In our previous blog we covered some common methods of finding an outliers. Starting with fixed thresholds to moving thresholds using averages and standard deviation. This forms the basis of data points that deviate from their norm. Using standard methods of outlier detection does have it pro’s and con’s. On one hand they are very easy to implement and derive reason for outlier detection. On the other hand, they are too simple

### Key Concepts In this Blog

The terminology we will use will not differ significantly from our previous blog except with the introduction of the following terms:

**Density Function Command:** The density function command is housed within the MLTK application. Most mathematicians think about Z-score or normal distribution when a density function is mentioned. However within Splunk’s MLTK command there are 4 distributions that attempt to map the distribution of the data. They are ‘Normal (Z-score), Exponential, Gaussian Kernel Density and Beta Distribution’. The command will attempt to find the best distribution fit for the data given the split-by field.

More can be read about this command the Splunk blog post here.

**Ensemble Methods:** Ensemble methods is a technique within the Machine Learning space to combine multiple algorithms to create an optimal model. My favourite blog channel ‘towards data science’ explains it here.

## Using MLTK Algorithms for Outlier Detection

Splunk’s Machine Learning Toolkit hosts 3 algorithms for anomaly detection. Whilst they each provide value, some may better suited in specific scenarios. Most if not all of the anomaly detection algorithms can be used on numerical and categorical fields. To illustrate their feature we picked the Density Function to enumerate and deep-dive into here.

### DensityFunction

The density function is a popular and relatively simple method of finding outliers assuming no changes to default parameters. Underneath the hood though, it contains many configuration opens for users to fine tune their outlier detection. Lets cover some of the parameters I’ve come across tuning in the field:

parameter | default_value | description | when to adjust parameter |

sample | false | When set to true, it takes sample from the ‘inliear’ region of the distribution. | When you have a large volume of data points + and need to remove initial outliers from your model. This is beneficial when you have more than a million events as the density function has a soft limit on events. |

full_sample | false | When set to ‘true’ it takes sample from the ‘inliear’ and ‘outlier’ region of the distribution. | When you have a large volume of data points and would like to consider outliers samples in your outlier detection model. This is beneficial when you have more than a million events as the density function has a soft limit on events. |

dest | auto | When set to ‘true’ the density of each data point will show. | Our suggestion is keep this as ‘auto’ . It allow MLTK to determine the best algorithm based on the values of each group. Set this as a specific distribution when the distribution is expected to a singular. |

random_state | N/A | Identify which distribution you should like to run. | Use this parameter when developing your model on a fixed data set. Using the same random_state with all other configurations unchanged will return the same result. This is helpful to use when two or more users are working off the same data. |

partial_fit | false | Incrementally ‘train’ and update the model based on new data. This is useful when you have a large volume of data and are unable to train. | Scenarios where using partial fit is helpful is to train your function over time where running a single fit command will reach an MLTK limit; such as over 7 days, 1 month and so on. |

threshold | 0.01 | 1 less than the threshold value shows the area under the curve. | This parameter can be adjusted to ease or restrict the criteria for detecting outliers. The smaller the value the less number of outliers (more restrictive) will be. On |

Adding onto a list of useful parameters, there is a ‘by’ clause within the Density Function. It gives the user the ability to group data based on a categorial field (grouping field). This grouping field can be location, intranet zone, host type, asset type, user type and much more. Lets start building some searches to utilize the density function:

#### Base Query

The base query to understand the Density Function a bit better is shown below. We are using the out-of-the-box lookup to run our query. The file ‘call_center.csv’ contains a list of call records count over a 2 month period starting in September 2017 till November. Next we are running 2 eval functions to extract the HourOfDay and DayOfWeek from the data.

```
| inputlookup call_center.csv
| eval _time=strptime(_time, "%Y-%m-%dT%H:%M:%S")
| eval HourOfDay=strftime(_time, "%H")
| eval DayOfWeek=strftime(_time, "%A")
```

#### Using The Density Function – Univariate Outlier

Univariate in this context means that we will analyze outliers based on a single field of interest. For this section we picked ‘DayOfWeek’ as the field. We appended any new queries to our base query. This search will run the density function on the total count of activity based on the DayOfTheWeek. It will give us list of outliers, boundary range the outlier fill in and the probability density of the outlier. The two columns to focus on are “IsOutlier(count)” and “ProbabilityDensity(count)”. The lower the probability density, the higher the chance the value of count on the DayOfTheWeek has to be an outlier.

`... | fit DensityFunction count by DayOfWeek into densityfunction_testmodel`

Once we run the fit command, we can see 3 new columns created named “isOutlier(count), BoundaryRanges, ProbabilityDensity(count)”.

Using the density function is quite simple as using the fit command provided by MLTK. In order to use the DensityFunction algorithm effectively I will recommend to test and verify outliers with different parameters configurations. One such parameter that you can modify is the ‘threshold’ function. To test its impact on the result, I created a table below to analyze the outlier count based on different values.

Threshold | Outlier Count ‘(| stats count(eval(is_outlier=1)) as outlier_count’ |

0.01 (default) | 116 |

0.05 | 590 |

0.09 | 1027 |

0.005 | 59 |

0.001 | 11 |

After creating the table, it becomes clear as the thresholds value decreases, the DensityFunction is more restrictive in finding outliers. As we increase the threshold value outliers number increase as it becomes less restrictive.

### Visualizing Changing Parameters with DensityFunction

To visual the impact of the different threshold levels we add the ‘show_option’ parameters to the density function as shown below:

`... | fit DensityFunction count by DayOfWeek into densityfunction_testmodel threshold=0.005 show_density=true show_options="feature_variables, split_by, params"`

After that, we use the MLTK ‘distribution plot’ visual to look at the distribution plot of the data.

From the above plot, each line represents each day of the week, Monday thru Sunday. However, to view this with clarity lets filter this visual to a single day only in the picture below.

```
... | fit DensityFunction count by DayOfWeek into densityfunction_testmodel threshold=0.005 show_density=true show_options="feature_variables, split_by, params"
| search DayOfWeek="Monday"
```

In the above distribution plot, we can see that on Monday, the call volumes of around 900, 1100 1250 and 1300 were detected as outliers. It might appear that the outliers exist on the right side of the graph because they are extremely high values compared to the rest of the calls volumes. However, they are detected as outliers due to the low probability of these high values occurring.

Looking at the same Distribution Plot of Monday, the range of values between 300-500 have a low count as well. As we increase the thresholds we to 0.05. Notice how more values with low counts will be selected. This is due to the lower probability of them occurring in the dataset of and us relaxing the threshold. This concept ties back directly to the table we built in the ‘`Using The Density Function - Univariate Outlier`

‘ section.

```
... | fit DensityFunction count by DayOfWeek into densityfunction_testmodel threshold=0.05 show_density=true show_options="feature_variables, split_by, params"
| search DayOfWeek="Monday"
```

## Conclusion

The DensityFunction is great and easy to use algorithm available along with MLTK for finding outliers. It’s very well documented in the Splunk Documentations and used commonly in the community. Some of the scenarios I’ve found it to be helpful in are listed below:

- Quick and easy scenario’s use of ML to find outliers
- Use cases where anomalies in aggregate counts are used to find outliers
- Finding unexpected counts over times and hours when activity counts are not expected

However, one drawback and feature to using the DensityFunction is that it does not have historic or outside context. It will find outliers based on counts and its deviations. This can result in a same entities selected as outliers in multiple iterations over multiples times. One of the ways to tackle this is via adding context and using scores in Splunk. We talk about this method in our blog https://discoveredintelligence.ca/reducing-outlier-noise-in-splunk/.