domingo, maio 11, 2025
HomeCloud ComputingHow We Leveraged Splunk to Solve Real Network Challenges

How We Leveraged Splunk to Solve Real Network Challenges


May is Observability Month—the perfect time to learn about Splunk and Observability. Find out more in our latest episode of “What’s new with Cisco U.?” (Scroll to the end of the blog to watch now!)


As part of the Cisco Infrastructure Operations team, we provide the interactive labs that users run on Cisco U. and use in instructor-led courses through Cisco and Cisco Learning Partners. We currently run two data centers that contain the delivery systems for all those labs, and we deliver thousands of labs daily.

We aim to deliver a reliable and efficient lab environment to every student. A lot is going on behind the scenes to make this happen, including monitoring. One important way we track the health of our infrastructure is by analyzing logs.

When picking infrastructure and tools, our philosophy is to “eat our own dog food” (or “drink our own champagne,” if you prefer). That means we use Cisco products everywhere possible. Cisco routers, switches, servers, Cisco Prime Network Registrar, Cisco Umbrella for DNS management, Cisco Identity Services Engine for authentication and authorization. You get the picture.

We used third-party software for some of our log analysis to track lab delivery. Our lab delivery systems (LDS) are internally developed and use logging messages that are entirely unique to them. We started using Elasticsearch several years ago, with almost zero prior experience, and it took many months to get our system up and running.

Then Cisco bought Splunk, and Splunk was suddenly our champagne! That’s when we made the call to migrate to Splunk.

Money played a role, too. Our internal IT at Cisco had begun offering Splunk Enterprise as a Service (EaaS) at a price much lower than our externally sourced Elasticsearch cloud instances. With Elasticsearch, we had to architect and manage all the VMs that made up a full Elastic stack, but using Splunk EaaS saved us a lot of time. (By the way, anyone can develop on Splunk Enterprise for six months free by registering at splunk>dev.) However, we started with limited prior training.

We had several months to transition, so learning Splunk was our first goal. We didn’t focus on just the single use case. Instead, we sent all our logs, not just our LDS logs, to Splunk. We configured routers, switches, ISEs, ASAs, Linux servers, load balancers (nginx), web servers (Ruby on Rails), and more. (See Appendix for more details on how we got the data into Splunk Enterprise.)

We were basically collecting a kitchen sink of logs and using them to learn more about Splunk. We needed basic development skills like using the Splunk Search Processing Language (SPL), building alarms, and creating dashboards. (See Resources for a list of the learning resources we relied on.)

Network equipment monitoring

We use SNMP to monitor our network devices, but we still have many systems from the configure-every-device-by-hand era. The configurations are all over the place. And the old NMS system UI is clunky. With Splunk, we built an alternate, more up-to-date system with straightforward logging configurations on the devices. We used the Splunk Connect for Syslog (SC4S) as a pre-processor for the syslog-style logs. (See the Appendix for more details on SC4S.)

Once our router and switch logs arrived in Splunk Enterprise, we started learning and experimenting with Splunk’s Search Processing Language. We were off and running after mastering a few basic syntax rules and functions. The Appendix lists every SPL function we needed to complete the projects described in this blog.

We quickly learned to build alerts; this was intuitive and required little training. We immediately received an alert regarding a power supply. Someone in the lab had disconnected the power cable accidentally. The time between receiving initial logs in Splunk and having a working alarm was very short.

Attacks on our public-facing systems

Over the summer, we had a suspicious meltdown on the web interface for our scheduling system. After a tedious time poring over logs, we found a large script-kiddie attack on the load balancer (the public-facing side of our scheduler). We solved the immediate issue by adding some throttling of connections to internal systems from the load balancer.

Then we investigated more by uploading archived nginx logs from the load balancer to Splunk. This was remarkably easy with the Universal Forwarder (see Appendix). Using those logs, we built a simple dashboard, which revealed that small-scale, script-kiddie attacks were happening all the time, so we decided to use Splunk to proactively shut these bad actors down. We mastered using the valuable stats command in SPL and set up some new alerts. Today, we have an alert system that detects all attacks and a rapid response to block the sources.

Out-of-control automation

We looked into our ISE logs and turned to our new SPL and dashboard skills to help us quickly assemble charts of login successes and failures. We immediately noticed a suspicious pattern of login failures by one particular user account that was used by backup automation for our network devices. A bit of digging revealed the automation was misconfigured. With a simple tweak to the configs, the noise was gone.

Human slip-ups

As part of our data center management, we use NetBox, a database specifically designed for network documentation. NetBox has dozens of object types for things like hardware devices, virtual machines, and network components like VLANs, and it keeps a change log for every object in the database. In the NetBox UI, you can view these change logs and do some simple searches, but we wanted more insight into how the database was being changed. Splunk happily ingested the JSON-formatted data from NetBox, with some identifying metadata added.

We built a dashboard showing the kinds of changes happening and who is making the changes. We also set an alarm to go off if many changes occurred quickly. Within a few weeks, the alarm had sounded. We saw a bunch of deletions, so we went looking for an explanation. We discovered a temporary worker had deleted some devices and replaced them. Some careful checking revealed incomplete replacements (some interfaces and IP addresses had been left off). After a word with the worker, the devices were updated correctly. And the monitoring continues.

Replacing Elasticsearch

Having learned quite a few basic Splunk skills, we were ready to work on replacing Elasticsearch for our lab delivery monitoring and statistics.

First, we needed to get the data in, so we configured Splunk’s Universal Forwarder to monitor the application-specific logs on all parts of our delivery system. We chose custom sourcetype values for the logs and then had to develop field extractions to get the data we were looking for. The learning time for this step was very short! Basic Splunk field extractions are just regular expressions applied to events based on the given sourcetype, source, or host. Field expressions are evaluated at search time. The Splunk Enterprise GUI provides a handy tool for developing those regular expressions. We also used regex101.com to develop and test the regular expressions. We built extractions that helped us track events and categorize them based on lab and student identifiers.

We sometimes encounter issues related to equipment availability. Suppose a Cisco U. user launches a lab that requires a particular set of equipment (for example, a set of Nexus switches for DC-related training), and there is no available equipment. In that case, they get a message that says, “Sorry, come back later,” and we get a log message. In Splunk, we built an alarm to track when this happens so we can proactively investigate. We can also use this data for capacity planning.

We needed to enrich our logs with more details about labs (like lab title and description) and more information about the students launching those labs (reservation number, for example). We quickly learned to use lookup tables. We only had to provide some CSV files with lab data and reservation information. In fact, the reservation lookup table is dynamically updated in Splunk using a scheduled report that searches the logs for new reservations and appends them to the CSV lookup table. With lookups in place, we built all the dashboards we needed to replace from Elasticsearch and more. Building dashboards that link to one another and link to reports was particularly easy. Our dashboards are much more integrated now and allow for perusing lab stats seamlessly.

As a result of our approach, we’ve got some useful new dashboards for monitoring our systems, and we replaced Elasticsearch, lowering our costs. We caught and resolved several issues while learning Splunk.

But we’ve barely scratched the surface. For example, our ISE log analysis could go much deeper by using the Splunk App and Add-on for Cisco Identity Services, which is covered in the Cisco U. tutorial, “Network Access Control Monitoring Using Cisco Identity Services Engine and Splunk.” We are also considering deploying our own instance of Splunk Enterprise to gain greater control over how and where the logs are stored.

We look forward to continuing the learning journey.


Splunk learning resources

We relied on three main resources to learn Splunk:

  • Splunk’s Free Online Training, especially these seven short courses:
    • Intro to Splunk
    • Using Fields
    • Scheduling Reports & Alerts
    • Search Under the Hood
    • Intro to Knowledge Objects
    • Introduction to Dashboards
    • Getting Data into Splunk
  • Splunk Documentation, especially these three areas:
  • Cisco U.
  • Searching
    • Searches on the Internet will often lead you to answers on Splunk’s Community boards, or you can go straight there. We also found useful information in blogs or other help sites.

NetBox:  https://github.com/netbox-community/netbox and https://netboxlabs.com

Elasticsearch: https://github.com/elastic/elasticsearch and https://www.elastic.co

Appendix

Getting data in: Metadata matters

It all begins at the source. Splunk stores logs as events and sets metadata fields for every event: time, source, sourcetype, and host. Splunk’s architecture allows searches using metadata fields to be speedy. Metadata must come from the source. Be sure to verify that the correct metadata is coming in from all your sources.

Getting data in: Splunk Universal Forwarder

The Splunk Universal Forwarder can be installed on Linux, Windows, and other standard platforms. We configured a few systems by hand and used Ansible for the rest. We were just monitoring existing log files for many systems, so the default configurations were sufficient. We used custom sourcetypes for our LDS, so setting these properly was the key for us to build field extractions for LDS logs.

Getting data in: Splunk Connect for Syslog

SC4S is purpose-built free software from Splunk that collects syslog data and forwards it to Splunk with metadata added. The underlying software is syslog-ng, but SC4S has its own configuration paradigm. We set up one SC4S per data center (and added a cold standby using keepalived). For us, getting SC4S set up appropriately was a non-trivial part of the project. If you need to use SC4S, allow for some time to set it up and tinker to get the settings right.

Searching with Splunk Search Processing Language

The following is a complete list of SPL functions we used:

  • eval
  • fields
  • top
  • stats
  • rename
  • timechart
  • table
  • append
  • dedup
  • lookup
  • inputlookup
  • iplocation
  • geostats

Permissions, permissions, permissions

Every object created in Splunk has a set of permissions assigned to it—every report, alarm, field extraction, and lookup table, etc. Take care when setting these; they can trip you up. For example, you might build a dashboard with permissions that allow other users to view it, but dashboards typically depend on lots of other objects like indexes, field extractions, and reports. If the permissions for those objects are not set correctly, your users will see lots of empty panels. It’s a pain, but details matter here.

Dive into Splunk, Observability, and more this month on Cisco U. Learn more

Sign up for Cisco U. | Join the  Cisco Learning Network today for free.

Follow Cisco Learning & Certifications

X | Threads | Facebook | LinkedIn | Instagram | YouTube

Use  #CiscoU and #CiscoCert to join the conversation.

Share:



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments