"IOCP/Socket: Connection reset (An existing connection was forcibly closed by the remote host." error when attempting to start_capture with Python API

Hi there,

We’ve started getting a mysterious “IOCP/Socket: Connection reset (An existing connection was forcibly closed by the remote host.” error when attempting to start_capture using the Python API. We run multiple automated builds a day from Jenkins, each build executing at least tens of tests one after the other. This issue happens sporadically, but we’ve experienced it around 1-2 times a week for the last few weeks. It happens on different test machines, and with different tests (the version of Logic 2 software is the same on all the test machines, and the tests are using the same underlying Python framework to connect and interface with the analyser).

See the full trace here:

  File "C:\env\lib\site-packages\unified_modules\resources\resource_saleae\resource_saleae_logic2.py", line 199, in start_capture
    self.capture = self.manager.start_capture(device_configuration=self.device_configuration,
  File "C:\env\lib\site-packages\saleae\automation\manager.py", line 583, in start_capture
    reply: saleae_pb2.StartCaptureReply = self.stub.StartCapture(
  File "C:\env\lib\site-packages\grpc\_channel.py", line 1181, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "C:\env\lib\site-packages\grpc\_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "IOCP/Socket: Connection reset (An existing connection was forcibly closed by the remote host.

 -- 10054)"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"IOCP/Socket: Connection reset (An existing connection was forcibly closed by the remote host.\r\n -- 10054)", grpc_status:14, created_time:"2025-01-15T18:42:46.2128235+00:00"}"
>

We’ve not had much success in root-causing this. We are attempting to get a small script together which reproduces this issue, which is decoupled from our wider Python framework. In the mean-time, any insights you have would be much appreciated!

Here’s what our framework does:

  1. Launches an instance of Logic2.exe (subprocess.Popen(“C:\Program Files\Logic\Logic.exe”)), closing any existing processes if open
  2. “Ping” the socket, to make sure it’s active using the Python socket library
  3. Connect to the analyser using automation.Manager.connect
  4. Check that a device is connected by checking len(self.manager.get_devices()) != 0
  5. Set device configuration using automation.LogicDeviceConfiguration
  6. Set capture configuration using automation.CaptureConfiguration (default we set is automation.TimedCaptureMode)
  7. Start the capture using start_capture, passing in the device and capture configuration as arguments
  8. Stop the capture using stop_capture
  9. Export the CSV using export_raw_data_csv, export the data table using export_data_table and export the SAL file using save_capture
  10. Close the application using close() and kill the Logic2.exe process

Note that if any issues occur during steps 1-9, a teardown method is called which executes step 10, to ensure that no latent process is left open before the next test begins.

It is at step 7 that we’re experiencing sporadic failures.

Here’s what we’ve tried so far:

  1. When the Exception is thrown, run netstat to find if there are any existing, latent processes blocking the port (there aren’t!)
  2. Added a 5 second delay after data export to ensure it is completed before starting the next test, in case that was blocking the subsequent start_capture

We’re looking at re-launching the application if the start_capture fails, as suggested in this related(?) post: RPC error - #7 by nick.smith. I note that the error in this post is slightly different from ours. Are there any further recommendations?

Another possible issue is that the team updated to Logic 2 software v2.4.22 when it became available. We’re using the Python API v1.0.7, but our “hand-rolled” version with grpcio fixed to 1.65.5, in order to find a workaround for the issue we reported on your GitHub: Relax grpcio & grpcio-tools version · Issue #49 · saleae/logic2-automation · GitHub. Could this be related?

This is a critical issue for us, as it’s causing frequent issues with our automated builds. Any help would be great.

Let me know if I can provide more information.

Best,
Sophie

Hi @sophie.hunter, sorry for the trouble with this!

One thing that can close that socket is if the Logic 2 software crashes.

Can you check to see if the Logic 2 software is still running after this occurs?

We’ll also want to collect the software logs, and get the software’s machine ID in case it uploaded any error reports to us.
Logs are located here: %APPDATA%\Logic\logs. Feel free to just send us a zip of all the logs from a range of time that you’re sure includes at least several failures. Please send those in through a support ticket: https://contact.saleae.com/hc/en-us/requests/new
Also, I recommend we move this issue over to a support ticket.

Instructions to get the machine ID here: Getting your Machine ID | Saleae Support

If you can’t run the software interactively on that machine, you can snag the config file which contains the machine ID and email that in too: %APPDATA%\Logic\config.json

I also have a few questions:

  • Have you seen this error outside of your CI environment?
  • What user account is your Jenkins CI runner using? (perhaps SYSTEM, or a regular user account?)
  • How many machines are running these tests? Just one?
  • How many Saleae devices do you have connected to the CI machine(s)? Just one? Which models?
  • Can you visually inspect the Saleae device after a failure? Can you check to see if the LED is green, white, or flashing red? (green and white both mean the device should be operating normally. flashing read indicates a firmware Assert, and may require a hardware reset)
  • What, if anything, does it take to get things working again? Does the next CI run generally just work, or do you need to reset anything manually to get the tests working again?

Sorry for the trouble with this, and I hope we can get this resolved soon.