Analyzer error causes API code to wait infinitely

Hi,

I’ve ran into an issue using the Logic2 Automation API in Python (v1.0.7). When I add a low level analyzer using the Capture.add_analyzer(…) function, most of the time it works fine. However sometimes there is an error with the analyzer, which causes the code to wait infinitely on this function call, without returning or throwing an error.

This failure of the analyzer, as you probably already know, can happen due to various reasons. For example, I’ve seen error messages in the GUI saying “Failed to keep up or no activity…“. It can happen e.g. when there is a loss of trigger signal, AKA an actual hardware failure I’m trying to debug.

In the GUI, this is handled by restarting the analyzer, as it’s described on the Support page. However, there is no way to restart an analyzer from the API, or even know that this error has happened (or if there is one, I didn’t find it). I know the code is still waiting for the add_analyzer to return, it didn’t crash, because if I restart the analyzer manually using the GUI, the code execution continues. Obviously this is not an option for a long automated process.

My suggestion would be to handle the error inside the API function, and throw an error. Alternatively, add a timeout, as a minimum, so the code doesn’t freeze.

I would love to hear your thoughts what else I can try in the meantime.

Many thanks,
Daniel

@daniel.adorjan Sorry for the trouble with that! You’re absolutely right — we need to recover automatically from this. Even better, this error shouldn’t occur at all.

In the meantime, I’d like to gather some more information from you to help us debug this.

  1. Can you share your machine ID with us?
  2. Which low level analyzer are you configuring via add_analyzer()? Is it one of the pre-installed ones, or a custom one?
  3. Are any of the error messages linked below the ones you are seeing? Are you seeing any others as well?
    Error Message: ReadTimeout | Saleae Support
    Error: Capture stopped because backlog data exceeded 90% | Saleae Support

In case you’d like to open a private support ticket to discuss this topic as well, feel free to fill out the contact form here and reference this forum post and your message will go directly to my inbox.

https://contact.saleae.com/hc/en-us/requests/new

Hope to get to the bottom of this for you as soon as possible!

Hi @timreyes,
Thank you for your reply, and your patience waiting for me to investigate this. To answer your questions first:

  1. MachineID: 95d2ece5-c45f-455b-9955-a8f604ad7bcf
  2. I’m only using pre-installed analyzers (mostly SPI and I2C).
  3. I have certainly seen these errors, but I know they can mostly be resolved with lower sample rates. The error is I believe different here, because sometimes there is no error message in the GUI, except for the analyzer error “Failed to keep up“.

After some trial-and-error I managed to write a cut-down version of the software to reproduce the error with a simulated device.

Observation 1: I can manually create the error in GUI with simulation, because I can add e.g. an SPI analyser and THEN start a looping capture, set trim after stopping to 10ms (this is the important part!), and then stopping immediately after start. Interestingly, there is no error showing in the GUI, except from the analyzer error. Increasing the trim time helps: as long as during capture the analyzer data falls behind LESS than the trim time (i.e. at the time of stop the tail of processing is within the trimmed data), then the analyzer completes. I believe this is the main reason.
However, this approach is almost impossible from the API, because I cannot add an analyzer before starting capture (perhaps intentionally). This means I cannot simulate SPI data, because it only appears when simulation has an analyser. Without actual data to analyze the SPI analyzer is faster and able to usually keep up.

Observation 2:
Adding an analyzer while the capture is still running causes the code to wait on the add_analyzer function, until the capture stops. However, the analyzer IS added, and it simply waits for the data returned. This means running for a long time waiting for data causes the analyzer to fall behind. Then if a trigger/stop finally happens, the trimming is applied at the end, and the analyzer shows the error. There is again, no error in the GUI, and the code is still stuck on add_analyzer function. You can try this with a relatively fast SPI channel.

Observation 3:
I get the same behaviour if I stop at a time when there is no activity on any of the lines. You can try this by not connecting to any SPI channels, and just running and stopping with the analyzer enabled. It is a different problem, but sometimes happens with faulty hardware. In most cases however, the add_analyzer returns OK, even though the “Failed to keep up“ error is printed on the GUI.

Building on these observations, I managed to create a python script that calls a simulated device. I have to set an intentionally short trim time (0.1ms) to catch the analyser in a bad state. Obviously this is not practical, but necessary for simulation. The other alternative is to somehow slow down the analyser processing speed, which I couldn’t figure out with simulation, or try with real SPI data.

Note that I start the capture on a thread, just to be able to stop and trim it after a certain time, even if the analyzer is stuck. Here’s the interesting part: sometimes the add_analyzer returns, sometimes it’s stuck, on average every 5th fails, so I added a loop to re-try. I cannot figure what’s the difference, in both capture files there is an analyzer error. When it is stuck, I can manually go into the GUI and restart the last capture’s analyzer, then the loop continues.

_saleae_analyzer_error.py (1.9 KB)

Hope this helps debugging this, and maybe you can help me understand this behaviour better, e.g. why it happens when it happens and what can I change to avoid it.

Many thanks,
Daniel

@daniel.adorjan Thanks for sending in your detailed observations! After reading through your summary, my quick thoughts are below:

  • You are correct in that, when using our Automation API, add_analyzer() cannot be added to a capture prior to the capture finishing. Due to this, protocol-specific simulation data (like SPI) currently cannot be generated using our Automation API.
  • It sounds like some kind or race condition is occurring where the analyzer is being added before the capture actually is finished. This may be why you can’t reproduce the error consistently.
  • “I get the same behaviour if I stop at a time when there is no activity on any of the lines.” - this is an interesting observation! The reason I say that is because there was a time when a flat signal used more resources than an active signal. I’m not sure if that’s still the case today, but we’ll check if this might still be relevant.

In any case, I think we’ve got all of the clues we need for now to begin investigating this for now! We’ll let you know if we have any updates, or if we need to gather anything else from you.

Hi @daniel.adorjan ,

I am having trouble reproducing the issue over here. I’d like to collect a little more information.

  1. What version of the software are you using?
  2. Can you reproduce the problem if you add the analyzer after the capture has finished? e.g. after capture.stop() or capture.wait()
  3. For your use-case, do you need to add it before the capture has completed?

Thanks!

Ryan

Hi Ryan,

Thanks for looking into this issue, and sorry to hear about your difficulty reproducing it. I was hoping the simulated devices behaved identical. Perhaps as Tim suggested, if it’s a race condition it might behave different depending on the machine?

  1. I’m using Logic2 Automation API v1.0.7 in Python 3.11.9. When using a physical device, it is a Logic Pro 16. The Logic2 About page reports the following versions:
    {“Environment”:“production”,
    “Branch”:“master”,
    “Commit”:“f7301b3b38816555bc63015a328a6c76985a2d78”,
    “Version”:“2.4.22”,
    “AutomationVersion”:“1.0.0”,
    “MachineID”:“95d2ece5-c45f-455b-9955-a8f604ad7bcf”,
    “PID”:24984,
    “LaunchId”:“e26e1a59-7786-415c-9957-072a593d2bc8”,
    “Architecture”:“x64”}

2-3) The short answer is no. If I add the analyzer after the capture stopped, there is no problem. However, it’s not always obvious to do. For example, for Trigger Capture I don’t necessarily know when the trigger happens. I cannot call .wait() because it blocks indefinitely, which is not practical for long automation scripts (e.g. if the trigger never happens I need to recover somehow). Moreover, Trigger Capture has a parameter after_trigger_seconds, so even if the trigger happens the recording is still running (would be fine if I used wait). The .stop() function could only be used for Manual Capture according the docs. There is no is_running or is_done flag in the API (as far as I know).

(Note: My work-around for these limitations was to use a python Thread, so I can block that thread with .wait() and set my own is_done flag. It’s not perfect however, if a trigger doesn’t happen I need to call .stop(), which is not advised in the docs (and it sometimes might still continue running). I assume trigger calls stop internally, so it might be called twice if I stopped after trigger waiting for after_trigger_seconds (again not recommended). Lot’s of grey areas, which is why I sent you the simplified script instead. Bottom line is, I cannot use blocking .wait() without a timeout, and the blocking .add_capture() is the same problem.)

I did find another not-so-elegant work around for the original problem:

There is no way to remove an analyzer in the API without the returned handle, and there is no way to restart it. So if I use a timeout to detect that the add_analyzer() is blocked for whatever reason, I can save the entire capture file, close it, then re-open. This will effectively restart the analyzer that was already added, although I cannot get a handle to it. I can however add a new analyzer, which will usually succeed.
It’s a slow process, especially for long captures, but at least a way to recover from a blocked function.

What will happen to the blocked thread with the add_analyzer? ¯\_(ツ)_/¯ It remains blocked, no exception thrown even after closing the capture instance underneath. So I just kill it and hope the GC does its job…

I cannot call .wait() because it blocks indefinitely, which is not practical for long automation scripts (e.g. if the trigger never happens I need to recover somehow).

Good point. Would a max wait time here help?

while not capture.wait(max_wait_time_seconds=1.0):
    # Capture not finished
    do_other_work()

# Capture finished
capture.add_analyzer(...)

This might be what you were saying w.r.t. “I cannot use blocking .wait() without a timeout”

It’s not perfect however, if a trigger doesn’t happen I need to call .stop(), which is not advised in the docs (and it sometimes might still continue running).

Have you seen issues with it continuing to run after calling stop() while using a trigger? I think we should revise the API on this one so that you can stop it manually. Despite what the docs say, I would be surprised if this didn’t just-work.

Hi Ryan,

Indeed a max-wait-time would help, but as I’ve seen in the source code, it’s not part of the latest release yet?

About the continuing run after stop(), I’ve not been able to isolate this issue, so it’s just speculation at this point. If I see more evidence in the future, I’ll open a separate ticket.

Glad to hear though that you think the stop() will work here. Actually, the docs say it

can be used with any capture mode, but this is recommended for use with ManualCaptureMode.

It’s just a curious case if a trigger happened just before calling stop and the capture already finished, what would happen? Because this is what the docs not recommend:

stop() and wait() should never both be used for a single capture. Do not call stop() more than once.

Hi Daniel,

Indeed a max-wait-time would help, but as I’ve seen in the source code, it’s not part of the latest release yet?

No, this would need to be added. I have added it to our internal list. I don’t have an estimate on when we would be implementing it, though.

It’s just a curious case if a trigger happened just before calling stop and the capture already finished, what would happen?

I looked at both the current source code, and when it was first released 3 years ago, and I don’t see any reason stop() and wait() won’t both immediately return if the capture has already stopped. I can also add this to our list to test and change the guarantees such that wait() and stop() will immediately return the capture has already stopped.

Ryan