How to Filter Unwanted Data without adding to Splunk Daily Indexing Volume

Splunk is a great tool for consolidating, processing and analysing voluminous data of all sorts, including syslog, Windows events/WMI etc. It’s pretty much like a Google search engine equivalent for your IT environment where you may have daily GBs or even TBs of raw logs and events to content with. Furthermore, you’re likely to find free Splunk apps to support and analyse the events of your favourite IT appliances and applications – sort of iOS and Android to entertain and educate your kids (or your bosses in this case).
It’s great in almost every aspect, except one – daily indexing volume that is limited by your Splunk licensing. And it doesn’t come cheap – typically a high 5 digit figure for a meagre single digit GB daily indexing volume. If you have limited budget and “unlimited” amount of data, you’ll have to start “rationing” and decide which types of data are of interest and value.
Of course, you can let Splunk do the auto-filtering but unwanted data are still counted to the daily volume, as it is performed after indexing. To “save” the daily volume limit, you have to filter out unwanted data before it reaches your Splunk indexer. Generally, there are 2 approaches i) filter it at the source devices (e.g. more specific and stringent ACL logging on Cisco IOS devices etc); ii) filtering using regular expression (REGEX) at the Splunk heavy forwarder (i.e. before Splunk indexer) installed on the data source. Unwanted data would be sent to the nullQueue for discard and wanted data would be sent to the index queue. I would be elaborating this method here. You may also want to check out this Splunk article as well.
How Splunk Moves Data through the Data Pipelines
First, we need to understand how data are being consumed, processed and moved about in Splunk. You can also read this full Splunk article, where I would just briefly summarise it here in sequential order:

  1. Input : Data is fed into Splunk. No data processing here.
  2. Parsing : Analyse and transform data according to regex transform rules
  3. Indexing : Splunk takes the parsed events and writes them to the search index on disk.
  4. Search :  Search through the indexed events. This is what you would see eventually.

As we can see that both input and parsing segments occur before indexing, regex filtering is performed in the parsing stage, which we would be focusing on.

Distributed Deployment
The next thing we need to understand is distributed deployment. We have to segregate parsing transformation from indexer, in order to ensure that unwanted filtered data would not count toward the indexing volume. For scalability reasons, 3 different Splunk roles can be setup in 2 or more machines:

  1. Forwarder : Input segment occurs here. For heavy forwarder, parsing and partial indexing can also be performed before sending it to the Indexer.
  2. Indexer : Indexing activities are processed here. Indexed data are written to disk.
  3. Search Head : The search GUI for users. It would search through various indexers to present the search results.

Minimally, we have to setup one heavy forwarder for input and parsing regex before sending the data to the indexer.

Step-by-Step Distributed Setup for Pre-Indexing Filter
One prerequisite is distributed setup whereby the Splunk Forwarder is separated from the Splunk Indexer. You may want to setup two separate Virtual Machines for testing purposes. One is designated as a dedicated input Heavy Forwarder and another is designated as Receiver cum Indexer cum Search Head.
Step 1: Setup Receiver and Heavy Forwarder
On the receiving indexer, under Splunk path “etc/system/local”, add the following lines to inputs.conf

# you may substitute “9997” with other TCP ports
# ensure that the listener is enabled.
disabled = 0

When you run “netstat -ano” on Windows system, you should be able to see TCP 9997 as a listening port.

On the Heavy Forwarder, add the following lines to the inputs.conf
# Specify a data source. Monitor files on “C:\TestLog” folder, which is empty at this moment.
# specify a source type for later identification
sourcetype = cisco_syslog
disabled = 0

Step 2: Ensure communication between Heavy Forwarder and Receiver is working
You should not just rely on network ping! Check the Splunkd log under Splunk Path “var/log/splunk”. You should see this line:
01-28-2012 21:48:18.502 +0800 INFO  TcpOutputProc – Connected to idx=
Otherwise, you would see Warning or Error, rectify them accordingly. If you see these messages:

01-28-2012 21:42:49.338 +0800 WARN  TcpOutputFd – Connect to failed. No connection could be made because the target machine actively refused it.
01-28-2012 21:42:49.338 +0800 ERROR TcpOutputFd – Connection to host= failed

Ensure that correct IP address in bold on the listening host [splunktcp://forwarder_IP:9997]. Otherwise, use  [splunktcp://:9997] instead to allow inputs from any Splunk forwarders.
Step 3: Configure Data Input on Heavy Forwarder 
You’ll have to specify the data source on the heavy forwarder. For testing purposes, I have created a sample syslog file with just 4 lines of sample data. I copied the file to the “C:\TestLog” folder that I created earlier.
Step 4: Test Distributed Processing is working
On the forwarder’s Splunk Web, go to Manager -> Data Inputs -> Files and Directories. Check that the “Number of Files” processed on “C:\TestLog” is incremented by one.
On the indexer/receiver’s Splunk Search app, check that all 4 lines are being indexed in distribution mode. No filtering is being done yet.
Step 5: Setup Data Filtering during Parsing Phase
You would only need to work on the heavy forwarder in this step. Out of the four syslog lines, only one line contains a keyword “error” that would be used as the REGEX key word here. To force Splunk forwarder to send data to the parsing queue (for REGEX filtering) instead of going directly into the Indexing queue, you have to add this bold line to the earlier inputs.conf:

sourcetype = cisco_syslog
queue = parsingQueue
disabled = 0

The bold line would cause Splunk to lookup to the props.conf under Splunk folder “etc/system/local”. Create this new file and add the following lines:

# you may replace this spec for something else. In this example, I’m using a sourcetype “cisco_syslog” 
# to match the “C:\TestLog” in the earlier inputs.conf
TRANSFORMS-set= setnull,setparsing

Create another file transforms.conf under the same folder and add the following lines:
#match anything with a single dot “.”

DEST_KEY = queue
FORMAT = nullQueue

REGEX = error
DEST_KEY = queue
FORMAT = indexQueue

In this example, all lines (except containing keyword “error”) would be discarded on Null Queue. The order is important. Make sure the [setnull] is always on top, otherwise all data would be discarded on Null Queue.  If you want to logic to be reversed (discard all lines containing “error”), you would just need to add the following lines instead:

REGEX = error
DEST_KEY = queue
FORMAT = nullQueue

Restart Splunk for the new configurations to take effect. One quick way is to use CLI “splunk restart” on the Splunk “bin” folder or click the “Restart Button” using Splunk Web manager.

Step 6: Test Pre-Indexing Filter

I copied another file (with a new file name) with the same 4 lines content. The file is consumed and sent over to the receiver. On the receiver’s Search App, examine the contents of the indexed data using the source command or just simply click on the link on the search summary. As you can see from below, only the line containing keyword “error” is being indexed. The other 3 lines are discarded on the null queue during the parsing segment at the heavy forwarder.