System Health dashboard

The System Health dashboard provides a large collection of charts that enable you to make sure that your ExtraHop system is running as expected, to troubleshoot issues, and to assess areas that are affecting performance. For example, you can monitor the number of packets processed by the ExtraHop system to ensure that packets are continuously captured.

Each chart in the Network Performance dashboard contains visualizations of system performance data that have been generated over the selected time interval, organized by region.

The System Health dashboard is a built-in, system dashboard that you cannot edit, delete, or add to a shared collection. However, you can copy a chart from the System Health dashboard and add it to a custom dashboard, or you can make a copy of the dashboard and edit it to monitor metrics that are relevant to you.

Note:The Administration settings page also provides status information and diagnostic tools for all ExtraHop systems.

Access the System Health page by clicking the System Settings icon or by clicking Dashboards from the top of the page. The System Health dashboard automatically displays information about the ExtraHop system you are connected to. If you are viewing the System Health dashboard from a console, you can click the site selector at the top of the page to view data for a specific site or for all sites in your environment.

Charts on the System Health dashboard are divided into the following sections:

Device Discovery
View the total amount of devices on your network. See which devices have been discovered and how many of those devices are currently active.
Data Feed
Assess the efficiency of the wire data collection process with charts related to throughput, packet rate, desyncs, and capture drops.
Records
View the total amount of records that are being sent to an attached recordstore..
Triggers
Monitor the impact of triggers on your ExtraHop system. See how often triggers are running, how often they are failing, and which triggers are placing the largest load on your CPU.
Open Data Stream and Recordstore
Follow the activity of open data stream (ODS) transmissions to and from your system. View the total number of remote connections, message throughput, and details related to specific remote targets.
SSL Certificates
Review the status information for all SSL certificates on your ExtraHop system.
Remote Packet Capture (RPCAP)
View the number of packets and frames that are sent and received by RPCAP peers.
Advanced Health Metrics
Track heap allocation related to data capture, the system datastore, triggers, and remote transmissions. Monitor write throughput, working set size, and trigger activity on the system datastore.

Device Discovery

The Device Discovery section of the System Health dashboard provides a view of the total amount of devices on your network. See which types of devices are connected and how many of those devices are currently active.

The Device Discovery section provides the following charts:

Active Devices

An area chart that displays the number of L2, L3, gateway, and custom devices that have been actively communicating on the network over the selected time interval. Next to the area chart, a value chart displays the number of L2, L3, gateway, and custom devices that were active over the selected time interval.

How this information can help you

Monitor this chart after making SPAN configuration changes to ensure that there were no unintended consequences that could put the ExtraHop system in a bad state. For example, accidental inclusion of a network can strain the capacity of the ExtraHop system capabilities by consuming more resources and requiring more packet handling, which results in poor performance. Check that the ExtraHop system is monitoring the expected number of active devices.

Total Devices

A line chart that displays the total number of L3 and custom devices monitored by the ExtraHop system, whether active or inactive, over the selected time interval. Next to the area chart, a value chart displays the total number of L3 and custom devices that are currently being monitored by the ExtraHop system.

How this information can help you

Monitor this chart after making SPAN configuration changes to ensure that there were no unintended consequences that could put the ExtraHop system in a bad state. For example, accidental inclusion of a network can strain the capacity of the ExtraHop system capabilities by consuming more resources and requiring more packet handling, which results in poor performance. Check that the ExtraHop system contains the expected number of total devices.

Data Feed

The Data Feed section of the System Health dashboard allows you to observe the efficiency of the wire data collection process with charts related to throughput, packet rate, desyncs, and capture drops.

The Data Feed section provides the following charts:

Throughput

An area chart depicting the throughput of incoming packets over the selected time interval, expressed in bytes per second. The chart displays throughput information for analyzed and filtered packets, as well as L2 and L3 duplicates.

How this information can help you

Exceeding product thresholds might result in data loss. For example, a high throughput rate might result in packets dropped at the span source or at a span aggregator. Similarly, large amount of L2 or L3 duplicates can also indicate an issue at the span source or span aggregator and might result in skewed or incorrect metrics.

The acceptable rate of bytes per second depends on your product. Refer to the ExtraHop Sensors datasheet to discover what the limits are for your ExtraHop system and determine if the rate of bytes per second is too high.

Packet Rate

An area chart that displays the rate of incoming packets, expressed in packets per second. The chart displays packet rate information for analyzed and filtered packets, as well as L2 and L3 duplicates.

How this information can help you

Exceeding product thresholds might result in data loss. For example, a high packet rate might result in packets dropped at the span source or at a span aggregator. Similarly, large amounts of L2 or L3 duplicates can also indicate an issue at the span source or span aggregator and might result in skewed or incorrect metrics.

The acceptable rate of packet per second depends on your product.Refer to the ExtraHop Sensors datasheet to discover what the limits are for your ExtraHop system and determine if the rate of packets per second is too high.

Analyzed Flows

A line chart that displays the number of flows that the ExtraHop system analyzed over the selected time interval. The chart also displays how many unidirectional flows occurred over the same time period. Next to the line chart, a value chart displays the total number of analyzed and unidirectional flows that occurred over the selected time interval. A flow is a set of packets that are part of a transaction between two endpoints over a protocol such as TCP, UDP, or ICMP.

How this information can help you

Exceeding product thresholds might result in data loss. For example, a high number of analyzed flows could result in packets dropped at the span source or at a span aggregator.

Desyncs

A line chart that displays occurrences of system-wide desyncs on the ExtraHop system over the selected time interval. Next to the line chart, a value chart displays the total number of desyncs that occurred over the selected time interval. A desync is when the ExtraHop data feed drops a TCP packet and, as a result, is no longer synchronized with a TCP connection.

How this information can help you

Large numbers of desyncs might indicate dropped packets on the monitoring interface, SPAN, or network tap.

If adjustments to your SPAN do not reduce a large number of desyncs, contact ExtraHop Support.

Truncated Packets

A line chart that displays occurrences of truncated packets on the ExtraHop system over the selected time interval. Next to the line chart, a value chart displays the total number of truncated packets that occurred over the selected time interval. A truncated packet occurs when the actual total length of the packet is less than the total length that is indicated in the IP header.

How this information can help you

Truncated packets might indicate packet slicing. A sensor discards all truncated packets it receives, which might cause desyncs to occur.

Capture Drop Rate

A line chart that displays the percentage of packets dropped at the network card interface on an ExtraHop system over the selected time interval.

How this information can help you

Packet drops often result when sensor thresholds are exceeded. Refer to the ExtraHop Sensors datasheet to discover what the limits are for your ExtraHop system.

Capture Load

A line chart that displays the percentage of cycles on the ExtraHop system that are consumed by active capture threads over the selected time interval, based on the total capture thread time. Click the associated Average Capture Load chart to drill down by thread and determine which threads are consuming the most resources.

How this information can help you

Look for spikes or upward growth of the capture load to monitor whether you are approaching sensor limits. Refer to the ExtraHop Sensors datasheet to discover the limits for your ExtraHop system.

Metrics Written to Disk (Log Scale)

A line chart that displays the amount of space consumed by metrics that were written to disk over the selected time interval, expressed in bytes per second. Because there is a large range between data points, the disk usage is displayed in logarithmic scale.

How this information can help you

It is important to stay aware of the amount of space that metrics are consuming on your datastore. The amount of space in your datastore will affect the amount of available lookback. If some metrics are consuming too much space, you can investigate associated triggers to see if you can modify the trigger to make it more efficient.

Metric Data Lookback Estimates

Displays the estimated datastore lookback metrics on the ExtraHop system. Lookback metrics are available in 24 hour, 1 hour, 5 minute, and 30 second time intervals based on the write throughput rate, which is expressed in bytes per second.

How this information can help you

Refer to this chart to determine how far back you are able to look up historical data for given time intervals. For example, you might be able to look up 1 hour intervals of data as far back as 9 days.

Records

The Records section of the System Health dashboard enables you to observe the efficiency of the wire data collection process with charts related to record counts and throughput.

The Data Feed section provides the following charts:

Record Count

A line chart that displays the number of records sent to a recordstore over the selected time interval. Next to the line chart, a value chart displays the total number of records sent over the selected time interval.

How this information can help you

An extremely high number of records sent to a recordstore can lead to long message queue lengths and dropped messages at the recordstore. View charts in the Open Data Stream and Recordstore section of the System Health dashboard for more information about recordstore transmissions.

Record Throughput

A line chart that displays the amount of records in bytes sent to a recordstore. Next to the line chart, a value chart displays the total amount of records sent in bytes over the selected time interval.

How this information can help you

This chart does not reflect size adjustments based on compression or deduplication and should not be referenced to estimate recordstore costs. An extremely high record throughput can lead to long message queue lengths and dropped messages at the recordstore. View charts in the Open Data Stream and Recordstore section of the System Health dashboard for more information about recordstore transmissions.

Triggers

The Triggers section of the System Health dashboard allows you to monitor the impact of triggers on your system. See how often triggers are running, how often they are failing, and which triggers are placing the largest load on your CPU.

The Triggers section provides the following charts:

Trigger Load

A line chart that displays the percentage of CPU cycles allocated for trigger processes that have been consumed by triggers during the selected time interval.

How this information can help you

Look for spikes or upward growth of the trigger load, especially after creating a new trigger or modifying an existing trigger. If you notice either condition, view the Trigger Load by Trigger chart to see which triggers are consuming the most resources.

Trigger Delay

A column chart that displays the maximum trigger delays that occurred over the selected time interval in milliseconds. Next to the column chart, a value chart displays the single longest trigger delay that occurred over the selected time interval. A trigger delay is the amount of time between when a trigger event is captured and a trigger thread is created for the event.

How this information can help you

Long trigger delays might indicate processing issues, view the Trigger Exceptions by Triggerand Trigger Load by Trigger charts to see which trigger is committing the most unhandled exceptions and which ones are consuming the most resources.

Trigger Executes and Drops

A line and column chart where the line chart displays the number of times triggers were run, and the accompanying column chart displays the number of times triggers were dropped, over the selected time interval. Next to the line and column chart, a value chart displays the total number of trigger executes and drops that occurred over the selected time interval. These charts provide an overall snapshot of all triggers currently running on the ExtraHop system.

How this information can help you

Look for spikes in the line and column chart and investigate any triggers that have resulted in the surge. For example, you might notice increased activity if a trigger has been modified or a new trigger has been enabled. View the Trigger Executes by Trigger chart to see which triggers are running most frequently.

Trigger Details

A list chart that displays individual triggers and the number of cycles, executes, and exceptions attributed to each over the selected time interval. By default, the list of triggers is sorted in descending order by trigger cycles.

How this information can help you

Identify which triggers are consuming the most cycles. Triggers that execute too frequently or otherwise consume more cycles than they should might be assigned to more sources than necessary. Make sure that any overactive trigger is only assigned to the specific source that you need to collect data from.

Trigger Load by Trigger

A line chart that displays the percentage of CPU cycles allocated for trigger processes that have been consumed by triggers during the selected time interval, listed by trigger name.

How this information can help you

Identify which triggers are consuming the most cycles. Triggers that consume more cycles than they should might be assigned to more sources than necessary. Make sure that any overactive trigger is only assigned to the specific source that you need to collect data from.

Trigger Executes by Trigger

A line chart that displays the number of times each active trigger ran over the selected time interval.

How this information can help you

Look for triggers that are running more frequently than you would expect, which might indicate that the trigger is assigned too broadly. A trigger assigned to all applications or all devices might have a heavy performance cost. A trigger assigned to a device group that has been expanded might collect metrics you do not want. To minimize performance impact, a trigger should be assigned only to the specific sources that you need to collect data from.

High activity might also indicate that a trigger is working harder than it needs to. For example, a trigger might run on multiple events where it would be more efficient to create separate triggers, or a trigger script might not adhere to recommended scripting guidelines as described in the Triggers Best Practices Guide.

Trigger Exceptions by Trigger

A line chart that displays the number of unhandled exceptions, sorted by trigger, that occurred on the ExtraHop system over the selected time interval.

How this information can help you

Trigger exceptions are the primary cause of trigger performance issues. If this graph indicates a trigger exception has occurred, you should investigate the trigger immediately.

Trigger Cycles by Thread

A line chart that displays the number of trigger cycles consumed by triggers for a thread.

How this information can help you

Trigger drops might occur if the consumption of one thread is considerably higher than the others, even if the thread consumption is at a low percentage. Look for an even amount of cycle consumption among threads.

Open Data Stream and Recordstore

The Open Data Stream (ODS) and Recordstore section of the System Health dashboard enables you to follow the activity of ODS and recordstore transmissions to and from your system. You can also view the total number of remote connections, message throughput, and details related to specific remote targets.

The Open Data Stream (ODS) and Recordstore section provides the following charts:

Message Throughput

A line chart that displays the throughput of remote message data, expressed in bytes. Next to the line chart, a value chart displays the average throughput rate of remote message data over the selected time interval. Remote messages are transmissions sent to a recordstore or to third-party systems from the ExtraHop system through an open data stream (ODS).

How this information can help you

Monitor this chart to make sure that bytes are being transferred as expected. If you are seeing low throughput numbers, there might be an issue with the configuration of an ODS or attached recordstore. Significant dips in throughput might indicate problems with your data streams.

Messages Sent

A line chart that displays the average rate that remote messages were sent from the ExtraHop system to a recordstore or open data stream (ODS) target. Next to the line chart, a value chart displays the total number of messages sent out over the selected time interval.

How this information can help you

Monitor this chart to make sure that packets are sent as expected. If no packets are sent, there might be an issue with the configuration of an ODS or attached recordstore.

Messages Dropped by Remote Type

A line chart that displays the average rate of remote messages that were dropped before they reached a recordstore or ODS target.

How this information can help you

Dropped messages indicate connectivity issues with the remote target. A high number of drops could also indicate that message throughput is too high to be processed by the ExtraHop system or the target server.

Message Send Errors

A line chart that displays the number of errors that occurred while sending a remote message to a recordstore or ODS target. Monitor this chart to make sure that packets are sent as expected. Transmission errors might involve the following:

Target Server Errors
The number of errors that are returned to the ExtraHop system by recordstores or ODS targets. These errors occurred on the target server and do not indicate an issue with the ExtraHop system.
Full Queue Dropped Messages
The number of messages sent to recordstores and ODS targets that were dropped because the message queue at the target server was full. A high number of dropped messages might indicate that message throughput is too high to be processed by the ExtraHop system or the target server. Look at the Exremote Message Queue Length by Target and the Target Details charts to see if your transmission errors might be related to a long message queue length.
Target Mismatch Dropped Messages
The number of remote messages dropped because the remote system specified in the Open Data Stream (ODS) trigger script does not match the name configured on the Open Data Streams page in Administration settings. Make sure that the names of remote systems are consistent in trigger scripts and Administration settings.
Decoding Errors Dropped Messages
The number of messages dropped as a result of internal encoding issues between ExtraHop Capture (excap) and ExtraHop Remote (exremote).

Connections

A line and column chart where the line chart displays the number of attempts the system made to connect to a remote target server and the accompanying column chart displays the number of errors that occurred as a result of those attempts. Next to the line and column chart, a value chart displays the total number of connection attempts and connection errors that occurred over the selected time interval.

How this information can help you

Identify target servers that are requiring an unusual amount of connection attempts or generating a disproportionate amount of connection errors. A spike in connection attempts might indicate that the target server is unavailable.

Exremote Message Queue Length by Target

A line chart that displays the number of messages in the ExtraHop Remote (exremote) queue waiting to be processed by the ExtraHop system.

How this information can help you

A high number of messages in the queue might indicate that message throughput is too high to be processed by the ExtraHop system or the target server. Refer to the Exremote Full Queue Dropped Messages value in the Message Send Errors chart to determine if message drops have occurred.

Excap Message Queue Length by Remote Type

A line chart that displays the number of remote target messages in the ExtraHop Capture (excap) queue waiting to be processed by the ExtraHop system.

How this information can help you

A high number of messages in the queue might indicate that message throughput is too high to be processed by the ExtraHop system or the target server.

Refer to the Messages Dropped by Remote Type chart to determine if message drops have occurred.

Target Details

A list chart that displays the following metrics related to recordstore or ODS remote targets over the selected time interval: target name, target message bytes out, target messages sent, target server errors, full queue dropped messages, decoding errors dropped messages, target server connection attempts, and target server connection errors.

How this information can help you

If you are seeing message errors reported in the Messages Sent chart, the details in this chart can help you determine the root cause of remote message errors.

SSL Certificates

The SSL Certificates section of the System Health dashboard allows you to review the status information for all SSL certificates on your system.

The SSL Certificates section provides the following chart:

Certificate Details

A list chart that displays the following information for each certificate:

Decrypted Sessions
The number of sessions that were successfully decrypted.
Unsupported Sessions
The number of sessions that could not be decrypted with passive analysis, such as DHE key exchange.
Detached Sessions
The number of sessions that were not decrypted or only partially decrypted due to desyncs.
Passthrough Sessions
The number of sessions that were not decrypted due to hardware errors, such as those caused by exceeding the specifications of SSL acceleration hardware.
Sessions Decrypted with Shared Secret
The number of sessions that were decrypted through a shared secret key.

How this information can help you

Monitor this chart to make sure that the correct SSL certificates are installed on the ExtraHop system and are performing decryption as expected.

Remote Packet Capture (RPCAP)

The Remote Packet Capture (RPCAP) section of the System Health dashboard enables you to view the number of packets and frames that were sent from RPCAP peers and received by the ExtraHop system.

The Remote Packet Capture (RPCAP) section provides the following charts:

Forwarded by Peer

A list chart that displays the following information regarding packets and frames that are forwarded by an RPCAP peer:

Forwarded Packets
The number of packets that an RPCAP peer attempted to forward to an ExtraHop system.
Forwarder Interface Packets
The total number of packets that were viewed by the forwarder. Forwarders on RPCAP devices will coordinate with each other to keep multiple devices from sending the same packet. This is the number of packets that were viewed before any frames were removed to reduce forwarded traffic, and before frames were removed by user-defined filters.
Forwarder Kernel Frame Drops
The number of frames that were dropped because the kernel of the RPCAP peer was overloaded with the stream of unfiltered frames. Unfiltered frames have not been filtered by the kernel to remove duplicate packets or packets that should not be forwarded because of user-defined rules.
Forwarder Interface Drops
The number of packets that were dropped because the RPCAP forwarder was overloaded with the stream of unfiltered frames. Unfiltered frames have not been filtered to remove duplicate packets or packets that should not be forwarded because of user-defined rules.

How this information can help you

Any time you see packets dropped by the RPCAP peer, it indicates that there is an issue with the RPCAP software.

Received by the ExtraHop system

A list chart that displays the following information regarding packets and frames that are received by an ExtraHop system from a Remote Packet Capture (RPCAP) peer:

Encapsulated Bytes
The total size of all packets related to the UDP flow from the RPCAP device to the ExtraHop system, in bytes. This information shows you how much traffic the RPCAP forwarder is adding to your network.
Encapsulated Packets
The number of packets related to the UDP flow from the RPCAP device to the ExtraHop system.
Tunnel Bytes
The total size of packets, not including encapsulation headers, that the ExtraHop system received from an RPCAP device, in bytes.
Tunnel Packets
The number of packets that the ExtraHop system received from an RPCAP peer. This number should be very close to the Forwarded Packets number in the Sent by Remote Device chart. If there is a big gap between these two numbers, then packets are dropping between the RPCAP device and the ExtraHop system.

How this information can help you

Tracking the encapsulated packets and bytes is a good way to make sure that RPCAP forwarders are not placing an unnecessary load on your network. You can monitor tunnel packets and bytes to make sure that the ExtraHop system is receiving everything that the RPCAP device is sending.

Advanced Health Metrics

The Advanced Health Metrics section of the System Health dashboard allows you to track heap allocation related to data capture, the system datastore, triggers, and remote transmissions. Monitor write throughput, working set size, and trigger activity on the system datastore.

The Advanced Health Metrics section provides the following charts:

Capture and Datastore Heap Allocation

A line chart that displays the amount of memory that the ExtraHop system dedicates to network packet capture and to the datastore.

How this information can help you

The data in this chart is for internal purposes and might be requested by ExtraHop Support to help you diagnose an issue.

Trigger and Remote Heap Allocation

A line chart that displays the amount of memory, expressed in bytes, that the ExtraHop system dedicates to processing capture triggers and to open data streams (ODS).

How this information can help you

The data in this chart is for internal purposes and might be requested by ExtraHop Support to help you diagnose an issue.

Store Write Throughput

An area chart that displays the datastore write throughput, expressed in bytes, on the ExtraHop system. The chart displays data for the selected time interval and for 24 hour, 1 hour, 5 minute, and 30 second intervals.

How this information can help you

The data in this chart is for internal purposes and might be requested by ExtraHop Support to help you diagnose an issue.

Working Set Size

An area chart that displays the write cache working set size for metrics on the ExtraHop system. The working set size indicates how many metrics can be written to the cache for the selected time interval and for 24 hour, 1 hour, 5 minute, and 30 second intervals.

How this information can help you

The data on this chart might spike after trigger creation or trigger modification if the trigger script is not collecting metrics efficiently.

Datastore Trigger Load

A line chart that displays the percentage of cycles consumed by datastore-specific triggers on the ExtraHop system, based on the total capture thread time.

How this information can help you

Look for spikes or upward growth of the datastore trigger load, especially after creating a new datastore trigger or modifying an existing datastore trigger. If you notice either, click on the Trigger Load metric label to drill down and see which datastore triggers are consuming the most resources.

Datastore Trigger Executes and Drops

A line and column chart where the line chart displays the number of times datastore-specific triggers on the ExtraHop system were run during the selected time interval, and the accompanying column chart displays the number of datastore-specific triggers dropped from the queue of triggers waiting to run on the ExtraHop system during the selected time interval.

How this information can help you

A single datastore trigger that runs often might indicate that the trigger has been assigned to all sources, such applications or devices. To minimize performance impact, a trigger should be assigned only to the specific sources that you need to collect data from.

From the Datastore Trigger Load chart, click on the Trigger Load metric label to drill down and see which datastore triggers are running most frequently.

Any drop data displayed on the column chart indicates that datastore trigger drops are occurring and that trigger queues are backed up.

The system queues trigger operations if a trigger thread is overloaded. If the datastore trigger queue grows too long, the system stops adding trigger operations to the queue and drops the triggers. Currently running triggers are unaffected.

The primary cause of long queues, and subsequent trigger drops, is a datastore long-running trigger.

Datastore Trigger Exceptions by Trigger

A list chart that displays the number of unhandled exceptions caused by datastore-specific triggers on the ExtraHop system.

How this information can help you

Datastore trigger exceptions are the primary cause of trigger performance issues. If this graph indicates a trigger exception has occurred, the datastore trigger should be corrected immediately.

Status and diagnostics tools in the Administration settings

The Administration settings is another source for system information and diagnostics.

For more metrics about the overall health of the ExtraHop system, and for diagnostic tools that enable ExtraHop Support to troubleshoot system errors, look at the Status and Diagnostics section of the Administration settings.

Last modified 2023-11-07