System Health dashboard
The System Health dashboard provides a large collection of charts that enable you to make sure that your ExtraHop system is running as expected, to troubleshoot issues, and to assess areas that are affecting performance. For example, you can monitor the number of packets processed by the ExtraHop system to ensure that packets are continuously captured.
Each chart in the Network Performance dashboard contains visualizations of system performance data that have been generated over the selected time interval, organized by region.
The System Health dashboard is a built-in, system dashboard that you cannot edit, delete, or add to a shared collection. However, you can copy a chart from the System Health dashboard and add it to a custom dashboard, or you can make a copy of the dashboard and edit it to monitor metrics that are relevant to you.
Note: | The Administration settings page also provides status information and diagnostic tools for all ExtraHop systems. |
Navigate the System Health dashboard
Access the System Health page by clicking the System Settings icon or by clicking Dashboards from the top of the page. The System Health dashboard automatically displays information about the ExtraHop system you are connected to. If you are viewing the System Health dashboard from a console, you can click the site selector at the top of the page to view data for a specific site or for all sites in your environment.
Charts on the System Health dashboard are divided into the following sections:
- Device Discovery
- View the total amount of devices on your network. See which devices have been discovered and how many of those devices are currently active.
- Data Feed
- Assess the efficiency of the wire data collection process with charts related to throughput, packet rate, desyncs, and capture drops.
- Records
- View the total amount of records that are being sent to an attached recordstore..
- Triggers
- Monitor the impact of triggers on your ExtraHop system. See how often triggers are running, how often they are failing, and which triggers are placing the largest load on your CPU.
- Open Data Stream and Recordstore
- Follow the activity of open data stream (ODS) transmissions to and from your system. View the total number of remote connections, message throughput, and details related to specific remote targets.
- SSL Certificates
- Review the status information for all SSL certificates on your ExtraHop system.
- Remote Packet Capture (RPCAP)
- View the number of packets and frames that are sent and received by RPCAP peers.
- Advanced Health Metrics
- Track heap allocation related to data capture, the system datastore, triggers, and remote transmissions. Monitor write throughput, working set size, and trigger activity on the system datastore.
Device Discovery
The Device Discovery section of the System Health dashboard provides a view of the total amount of devices on your network. See which types of devices are connected and how many of those devices are currently active.
The Device Discovery section provides the following charts:
Active Devices
An area chart that displays the number of L2, L3, gateway, and custom devices that have been actively communicating on the network over the selected time interval. Next to the area chart, a value chart displays the number of L2, L3, gateway, and custom devices that were active over the selected time interval.
How this information can help you
Monitor this chart after making SPAN configuration changes to ensure that there were no unintended consequences that could put the ExtraHop system in a bad state. For example, accidental inclusion of a network can strain the capacity of the ExtraHop system capabilities by consuming more resources and requiring more packet handling, which results in poor performance. Check that the ExtraHop system is monitoring the expected number of active devices.
Total Devices
A line chart that displays the total number of L3 and custom devices monitored by the ExtraHop system, whether active or inactive, over the selected time interval. Next to the area chart, a value chart displays the total number of L3 and custom devices that are currently being monitored by the ExtraHop system.
How this information can help you
Monitor this chart after making SPAN configuration changes to ensure that there were no unintended consequences that could put the ExtraHop system in a bad state. For example, accidental inclusion of a network can strain the capacity of the ExtraHop system capabilities by consuming more resources and requiring more packet handling, which results in poor performance. Check that the ExtraHop system contains the expected number of total devices.
Data Feed
The Data Feed section of the System Health dashboard allows you to observe the efficiency of the wire data collection process with charts related to throughput, packet rate, desyncs, and capture drops.
The Data Feed section provides the following charts:
Throughput
An area chart depicting the throughput of incoming packets over the selected time interval, expressed in bytes per second. The chart displays throughput information for analyzed and filtered packets, as well as L2 and L3 duplicates.
How this information can help you
Exceeding product thresholds might result in data loss. For example, a high throughput rate might result in packets dropped at the span source or at a span aggregator. Similarly, large amount of L2 or L3 duplicates can also indicate an issue at the span source or span aggregator and might result in skewed or incorrect metrics.
The acceptable rate of bytes per second depends on your product. Refer to the ExtraHop Sensors datasheet to discover what the limits are for your ExtraHop system and determine if the rate of bytes per second is too high.
Packet Rate
An area chart that displays the rate of incoming packets, expressed in packets per second. The chart displays packet rate information for analyzed and filtered packets, as well as L2 and L3 duplicates.
How this information can help you
Exceeding product thresholds might result in data loss. For example, a high packet rate might result in packets dropped at the span source or at a span aggregator. Similarly, large amounts of L2 or L3 duplicates can also indicate an issue at the span source or span aggregator and might result in skewed or incorrect metrics.The acceptable rate of packet per second depends on your product.Refer to the ExtraHop Sensors datasheet to discover what the limits are for your ExtraHop system and determine if the rate of packets per second is too high.
Analyzed Flows
A line chart that displays the number of flows that the ExtraHop system analyzed over the selected time interval. The chart also displays how many unidirectional flows occurred over the same time period. Next to the line chart, a value chart displays the total number of analyzed and unidirectional flows that occurred over the selected time interval. A flow is a set of packets that are part of a transaction between two endpoints over a protocol such as TCP, UDP, or ICMP.
Desyncs
A line chart that displays occurrences of system-wide desyncs on the ExtraHop system over the selected time interval. Next to the line chart, a value chart displays the total number of desyncs that occurred over the selected time interval. A desync is when the ExtraHop data feed drops a TCP packet and, as a result, is no longer synchronized with a TCP connection.
How this information can help you
Large numbers of desyncs might indicate dropped packets on the monitoring interface, SPAN, or network tap.If adjustments to your SPAN do not reduce a large number of desyncs, contact ExtraHop Support.
Truncated Packets
A line chart that displays occurrences of truncated packets on the ExtraHop system over the selected time interval. Next to the line chart, a value chart displays the total number of truncated packets that occurred over the selected time interval. A truncated packet occurs when the actual total length of the packet is less than the total length that is indicated in the IP header.
How this information can help you
Truncated packets might indicate packet slicing. A sensor discards all truncated packets it receives, which might cause desyncs to occur.Capture Drop Rate
A line chart that displays the percentage of packets dropped at the network card interface on an ExtraHop system over the selected time interval.
How this information can help you
Packet drops often result when sensor thresholds are exceeded. Refer to the ExtraHop Sensors datasheet to discover what the limits are for your ExtraHop system.Capture Load
A line chart that displays the percentage of cycles on the ExtraHop system that are consumed by active capture threads over the selected time interval, based on the total capture thread time. Click the associated Average Capture Load chart to drill down by thread and determine which threads are consuming the most resources.
How this information can help you
Look for spikes or upward growth of the capture load to monitor whether you are approaching sensor limits. Refer to the ExtraHop Sensors datasheet to discover the limits for your ExtraHop system.Metrics Written to Disk (Log Scale)
A line chart that displays the amount of space consumed by metrics that were written to disk over the selected time interval, expressed in bytes per second. Because there is a large range between data points, the disk usage is displayed in logarithmic scale.
How this information can help you
It is important to stay aware of the amount of space that metrics are consuming on your datastore. The amount of space in your datastore will affect the amount of available lookback. If some metrics are consuming too much space, you can investigate associated triggers to see if you can modify the trigger to make it more efficient.Records
The Records section of the System Health dashboard enables you to observe the efficiency of the wire data collection process with charts related to record counts and throughput.
The Data Feed section provides the following charts:
Record Count
A line chart that displays the number of records sent to a recordstore over the selected time interval. Next to the line chart, a value chart displays the total number of records sent over the selected time interval.
How this information can help you
An extremely high number of records sent to a recordstore can lead to long message queue lengths and dropped messages at the recordstore. View charts in the Open Data Stream and Recordstore section of the System Health dashboard for more information about recordstore transmissions.
Record Throughput
A line chart that displays the amount of records in bytes sent to a recordstore. Next to the line chart, a value chart displays the total amount of records sent in bytes over the selected time interval.
How this information can help you
This chart does not reflect size adjustments based on compression or deduplication and should not be referenced to estimate recordstore costs. An extremely high record throughput can lead to long message queue lengths and dropped messages at the recordstore. View charts in the Open Data Stream and Recordstore section of the System Health dashboard for more information about recordstore transmissions.
Triggers
The Triggers section of the System Health dashboard allows you to monitor the impact of triggers on your system. See how often triggers are running, how often they are failing, and which triggers are placing the largest load on your CPU.
The Triggers section provides the following charts:
Trigger Load
A line chart that displays the percentage of CPU cycles allocated for trigger processes that have been consumed by triggers during the selected time interval.
How this information can help you
Look for spikes or upward growth of the trigger load, especially after creating a new trigger or modifying an existing trigger. If you notice either condition, view the Trigger Load by Trigger chart to see which triggers are consuming the most resources.Trigger Delay
A column chart that displays the maximum trigger delays that occurred over the selected time interval in milliseconds. Next to the column chart, a value chart displays the single longest trigger delay that occurred over the selected time interval. A trigger delay is the amount of time between when a trigger event is captured and a trigger thread is created for the event.
How this information can help you
Long trigger delays might indicate processing issues, view the Trigger Exceptions by Triggerand Trigger Load by Trigger charts to see which trigger is committing the most unhandled exceptions and which ones are consuming the most resources.
Trigger Executes and Drops
A line and column chart where the line chart displays the number of times triggers were run, and the accompanying column chart displays the number of times triggers were dropped, over the selected time interval. Next to the line and column chart, a value chart displays the total number of trigger executes and drops that occurred over the selected time interval. These charts provide an overall snapshot of all triggers currently running on the ExtraHop system.
How this information can help you
Look for spikes in the line and column chart and investigate any triggers that have resulted in the surge. For example, you might notice increased activity if a trigger has been modified or a new trigger has been enabled. View the Trigger Executes by Trigger chart to see which triggers are running most frequently.Trigger Details
A list chart that displays individual triggers and the number of cycles, executes, and exceptions attributed to each over the selected time interval. By default, the list of triggers is sorted in descending order by trigger cycles.
How this information can help you
Identify which triggers are consuming the most cycles. Triggers that execute too frequently or otherwise consume more cycles than they should might be assigned to more sources than necessary. Make sure that any overactive trigger is only assigned to the specific source that you need to collect data from.Trigger Load by Trigger
A line chart that displays the percentage of CPU cycles allocated for trigger processes that have been consumed by triggers during the selected time interval, listed by trigger name.
How this information can help you
Identify which triggers are consuming the most cycles. Triggers that consume more cycles than they should might be assigned to more sources than necessary. Make sure that any overactive trigger is only assigned to the specific source that you need to collect data from.Trigger Executes by Trigger
A line chart that displays the number of times each active trigger ran over the selected time interval.
How this information can help you
Look for triggers that are running more frequently than you would expect, which might indicate that the trigger is assigned too broadly. A trigger assigned to all applications or all devices might have a heavy performance cost. A trigger assigned to a device group that has been expanded might collect metrics you do not want. To minimize performance impact, a trigger should be assigned only to the specific sources that you need to collect data from.High activity might also indicate that a trigger is working harder than it needs to. For example, a trigger might run on multiple events where it would be more efficient to create separate triggers, or a trigger script might not adhere to recommended scripting guidelines as described in the Triggers Best Practices Guide.
Trigger Exceptions by Trigger
A line chart that displays the number of unhandled exceptions, sorted by trigger, that occurred on the ExtraHop system over the selected time interval.
Open Data Stream and Recordstore
The Open Data Stream (ODS) and Recordstore section of the System Health dashboard enables you to follow the activity of ODS and recordstore transmissions to and from your system. You can also view the total number of remote connections, message throughput, and details related to specific remote targets.
The Open Data Stream (ODS) and Recordstore section provides the following charts:
Message Throughput
A line chart that displays the throughput of remote message data, expressed in bytes. Next to the line chart, a value chart displays the average throughput rate of remote message data over the selected time interval. Remote messages are transmissions sent to a recordstore or to third-party systems from the ExtraHop system through an open data stream (ODS).
How this information can help you
Monitor this chart to make sure that bytes are being transferred as expected. If you are seeing low throughput numbers, there might be an issue with the configuration of an ODS or attached recordstore. Significant dips in throughput might indicate problems with your data streams.Messages Sent
A line chart that displays the average rate that remote messages were sent from the ExtraHop system to a recordstore or open data stream (ODS) target. Next to the line chart, a value chart displays the total number of messages sent out over the selected time interval.
Messages Dropped by Remote Type
A line chart that displays the average rate of remote messages that were dropped before they reached a recordstore or ODS target.
Message Send Errors
A line chart that displays the number of errors that occurred while sending a remote message to a recordstore or ODS target. Monitor this chart to make sure that packets are sent as expected. Transmission errors might involve the following:
- Target Server Errors
- The number of errors that are returned to the ExtraHop system by recordstores or ODS targets. These errors occurred on the target server and do not indicate an issue with the ExtraHop system.
- Full Queue Dropped Messages
- The number of messages sent to recordstores and ODS targets that were dropped because the message queue at the target server was full. A high number of dropped messages might indicate that message throughput is too high to be processed by the ExtraHop system or the target server. Look at the Exremote Message Queue Length by Target and the Target Details charts to see if your transmission errors might be related to a long message queue length.
- Target Mismatch Dropped Messages
- The number of remote messages dropped because the remote system specified in the Open Data Stream (ODS) trigger script does not match the name configured on the Open Data Streams page in Administration settings. Make sure that the names of remote systems are consistent in trigger scripts and Administration settings.
- Decoding Errors Dropped Messages
- The number of messages dropped as a result of internal encoding issues between ExtraHop Capture (excap) and ExtraHop Remote (exremote).
Connections
A line and column chart where the line chart displays the number of attempts the system made to connect to a remote target server and the accompanying column chart displays the number of errors that occurred as a result of those attempts. Next to the line and column chart, a value chart displays the total number of connection attempts and connection errors that occurred over the selected time interval.
Exremote Message Queue Length by Target
A line chart that displays the number of messages in the ExtraHop Remote (exremote) queue waiting to be processed by the ExtraHop system.
Excap Message Queue Length by Remote Type
A line chart that displays the number of remote target messages in the ExtraHop Capture (excap) queue waiting to be processed by the ExtraHop system.
How this information can help you
A high number of messages in the queue might indicate that message throughput is too high to be processed by the ExtraHop system or the target server.Refer to the Messages Dropped by Remote Type chart to determine if message drops have occurred.
Target Details
A list chart that displays the following metrics related to recordstore or ODS remote targets over the selected time interval: target name, target message bytes out, target messages sent, target server errors, full queue dropped messages, decoding errors dropped messages, target server connection attempts, and target server connection errors.
How this information can help you
If you are seeing message errors reported in the Messages Sent chart, the details in this chart can help you determine the root cause of remote message errors.SSL Certificates
The SSL Certificates section of the System Health dashboard allows you to review the status information for all SSL certificates on your system.
The SSL Certificates section provides the following chart:
Certificate Details
A list chart that displays the following information for each certificate:
- Decrypted Sessions
- The number of sessions that were successfully decrypted.
- Unsupported Sessions
- The number of sessions that could not be decrypted with passive analysis, such as DHE key exchange.
- Detached Sessions
- The number of sessions that were not decrypted or only partially decrypted due to desyncs.
- Passthrough Sessions
- The number of sessions that were not decrypted due to hardware errors, such as those caused by exceeding the specifications of SSL acceleration hardware.
- Sessions Decrypted with Shared Secret
- The number of sessions that were decrypted through a shared secret key.
Remote Packet Capture (RPCAP)
The Remote Packet Capture (RPCAP) section of the System Health dashboard enables you to view the number of packets and frames that were sent from RPCAP peers and received by the ExtraHop system.
The Remote Packet Capture (RPCAP) section provides the following charts:
Forwarded by Peer
A list chart that displays the following information regarding packets and frames that are forwarded by an RPCAP peer:
- Forwarded Packets
- The number of packets that an RPCAP peer attempted to forward to an ExtraHop system.
- Forwarder Interface Packets
- The total number of packets that were viewed by the forwarder. Forwarders on RPCAP devices will coordinate with each other to keep multiple devices from sending the same packet. This is the number of packets that were viewed before any frames were removed to reduce forwarded traffic, and before frames were removed by user-defined filters.
- Forwarder Kernel Frame Drops
- The number of frames that were dropped because the kernel of the RPCAP peer was overloaded with the stream of unfiltered frames. Unfiltered frames have not been filtered by the kernel to remove duplicate packets or packets that should not be forwarded because of user-defined rules.
- Forwarder Interface Drops
- The number of packets that were dropped because the RPCAP forwarder was overloaded with the stream of unfiltered frames. Unfiltered frames have not been filtered to remove duplicate packets or packets that should not be forwarded because of user-defined rules.
Received by the ExtraHop system
A list chart that displays the following information regarding packets and frames that are received by an ExtraHop system from a Remote Packet Capture (RPCAP) peer:
- Encapsulated Bytes
- The total size of all packets related to the UDP flow from the RPCAP device to the ExtraHop system, in bytes. This information shows you how much traffic the RPCAP forwarder is adding to your network.
- Encapsulated Packets
- The number of packets related to the UDP flow from the RPCAP device to the ExtraHop system.
- Tunnel Bytes
- The total size of packets, not including encapsulation headers, that the ExtraHop system received from an RPCAP device, in bytes.
- Tunnel Packets
- The number of packets that the ExtraHop system received from an RPCAP peer. This number should be very close to the Forwarded Packets number in the Sent by Remote Device chart. If there is a big gap between these two numbers, then packets are dropping between the RPCAP device and the ExtraHop system.
How this information can help you
Tracking the encapsulated packets and bytes is a good way to make sure that RPCAP forwarders are not placing an unnecessary load on your network. You can monitor tunnel packets and bytes to make sure that the ExtraHop system is receiving everything that the RPCAP device is sending.
Advanced Health Metrics
The Advanced Health Metrics section of the System Health dashboard allows you to track heap allocation related to data capture, the system datastore, triggers, and remote transmissions. Monitor write throughput, working set size, and trigger activity on the system datastore.
The Advanced Health Metrics section provides the following charts:
Capture and Datastore Heap Allocation
A line chart that displays the amount of memory that the ExtraHop system dedicates to network packet capture and to the datastore.
How this information can help you
The data in this chart is for internal purposes and might be requested by ExtraHop Support to help you diagnose an issue.Trigger and Remote Heap Allocation
A line chart that displays the amount of memory, expressed in bytes, that the ExtraHop system dedicates to processing capture triggers and to open data streams (ODS).
How this information can help you
The data in this chart is for internal purposes and might be requested by ExtraHop Support to help you diagnose an issue.Store Write Throughput
An area chart that displays the datastore write throughput, expressed in bytes, on the ExtraHop system. The chart displays data for the selected time interval and for 24 hour, 1 hour, 5 minute, and 30 second intervals.
How this information can help you
The data in this chart is for internal purposes and might be requested by ExtraHop Support to help you diagnose an issue.Working Set Size
An area chart that displays the write cache working set size for metrics on the ExtraHop system. The working set size indicates how many metrics can be written to the cache for the selected time interval and for 24 hour, 1 hour, 5 minute, and 30 second intervals.
Datastore Trigger Load
A line chart that displays the percentage of cycles consumed by datastore-specific triggers on the ExtraHop system, based on the total capture thread time.
How this information can help you
Look for spikes or upward growth of the datastore trigger load, especially after creating a new datastore trigger or modifying an existing datastore trigger. If you notice either, click on the Trigger Load metric label to drill down and see which datastore triggers are consuming the most resources.Datastore Trigger Executes and Drops
A line and column chart where the line chart displays the number of times datastore-specific triggers on the ExtraHop system were run during the selected time interval, and the accompanying column chart displays the number of datastore-specific triggers dropped from the queue of triggers waiting to run on the ExtraHop system during the selected time interval.
How this information can help you
A single datastore trigger that runs often might indicate that the trigger has been assigned to all sources, such applications or devices. To minimize performance impact, a trigger should be assigned only to the specific sources that you need to collect data from.From the Datastore Trigger Load chart, click on the Trigger Load metric label to drill down and see which datastore triggers are running most frequently.
Any drop data displayed on the column chart indicates that datastore trigger drops are occurring and that trigger queues are backed up.
The system queues trigger operations if a trigger thread is overloaded. If the datastore trigger queue grows too long, the system stops adding trigger operations to the queue and drops the triggers. Currently running triggers are unaffected.
The primary cause of long queues, and subsequent trigger drops, is a datastore long-running trigger.
Status and diagnostics tools in the Administration settings
The Administration settings is another source for system information and diagnostics.
For more metrics about the overall health of the ExtraHop system, and for diagnostic tools that enable ExtraHop Support to troubleshoot system errors, look at the Status and Diagnostics section of the Administration settings.
Thank you for your feedback. Can we contact you to ask follow up questions?