At some point during data taking you will get some kind of 'ssp error'. The following page will help you diagnose the nature of the SSP problems so that when you call me, you may be better able to describe the problem or better yet, fix it yourself.
One of the new features of the DAQ system is its split cable segment. Instead of having one cable segment, we have 3, (we may have 4 in the near future), so that for one to be able to diagnose any kind of problem or figure out how to fix it, one needs to understand the topology of this split cable segment. The basic point is this. An SSP starts to show signs of trouble. From the error messages on the screen can you identify which SSP upstairs is giving you trouble? Do you know how to address the problem SSP using our diagnostic tools? Since we now readout the SSP's via a PPC/SFI, the rules of how to access a problem SSP are a bit complicated.
There are 3 PPC windows on the DAQ console titled 'ppc01',
'ppc02' and 'ppc03'. Each one of these windows 'talks' to its own cable
segment. Thus cable segment #1 is accessed during DAQ via ppc01, cable
segment #2 via ppc02 and so on. There is a second path into the SSP
cable segment system and that is from bnlku7 via branch bus. This is the
DAQ data path which was used during the 92-97 runs. All the diagnostic
software we have uses this branch bus and more specifically the Bob Hackenberg
cable segment addressing scheme. At this point you should refer to the
next figure. What you see there is this dual path cable segment architecture
which is now in place. The PPC/SFI modules talk directly to their own cable
segment, while the branch bus path which connects bnlku7 to the master
fastbus crate is this alternate path one uses for diagnostics.
You should now refer to the diagram above. There are 3 devices in this digram, 1) workstations, 2) PPC/SFI modules, 3) SSPs. The workstations are bnlku7 and bnlku9. The PPC/SFI modules are the round objects labeled ppc01, ppc02 and ppc03 and the SSPs are the long rectangular objects some of which have labels in them describing the fastbus crate they readout. (i.e. TRIG, FERA, ADC1 etc.) Connecting all these SSPs together is what is called the cable segment. This is the line drawn in green. A cable segment is much like a SCSI bus, in that it is the data conduit along which data flows to and from the SSPs connected to the same cable segment.
If you take a close look, you will notice that there are two paths from the workstations (bnlku7 and bnlku9) to the 20 SSP's used during DAQ readout. One is through the Internet from either bnlku9 or bnlku7, to the PPC/SFI modules and then onto the cable segment, and the other is from bnlku7, through the Branch Bus/FB crate, then through an SSP and then onto the cable segment. These two paths means that there are two addressing schemes used to 'talk to' all the SSP's.
Addressing the SSP's through the Internet/PPC/SFI: The PPC/SFI module is basically a computer which sits in a fastbus module and is attached directly onto the cable segment. The PPC runs the VxWorks operating system which means that you can in effect, log into the PPC and type commands or run programs which will talk to the SSP's sitting on its cable segment. Since the PPC/SFI module sit in a fastbus crate it can also talk to a fastbus board attached to the same crate. Therefore the addressing scheme for the PPC/SFI modules is rather simple, for fastbus boards plugged into the fastbus create, the address of that board is just the slot number its plugged into. A fastbus board plugged into slot 5, has an address of 5. For the SSP's on the cable segment, their addresses are set by the front panel dip switches. An SSP with a front panel dip switch setting of 3 has an address of 0x103. (All addresses are written in hex.) The 9th bit of the address number determines if the PPC/SFI is to talk to the cable segment or to the crate. So, the addresses for the SSP's from the PPC/SFI modules will be of the form 0x1XX where XX is the front panel dip switch setting of the SSP's.
Addressing the SSP's through the Branch Bus/Fast Bus crate: The second path to the SSP's on the cable segments is through this Branch Bus/Fast Bus crate thing. What this means is that on bnlku7, there is a special interface which allows bnlku7 to talk directly to a fastbus crate. This allows bnlku7 to directly read and write to any module in the fastbus crate in which this Branch Bus interface is plugged into. In our experiment, the name given to this fastbus crate is called the 'Master Fastbus Crate' or Master Crate for short. Sitting in this master crate are 3 SSPs. One in slot 6, another in slot 8 and the last in slot A. Each one of these SSPs is then connected to one of the 3 cable segments as shown in the diagram above. To make things a bit more complicated, in order for bnlku7 to be able to talk to one of the SSP's on any one of the cable segments, it needs to use one of these 3 SSP's as gateways. In order to use one of these SSP's as gateways, some special software written by Bob Hackenberg, has to be loaded into these 'gateway' SSPs. Thus, once everything is setup, bnlku7 can talk to any one of the SSPs on any one of the cables segments and this is how the special diagnostic software is run. What you need to know now is how to address these SSP's on the cable segment through the 'gateway' SSP's. This is simple, the address is 4 bytes. The upper two bytes indicates the slot in which the 'gate way' SSP is located and the lower 2 bytes is the front panel dip switch address of the SSP's on the cable segment. So if you want to address the 'TRIG' ssp which sits on cable segment #1 from bnlku7 the address is 801. To address the SSP which sits in the TD04 fastbus create, you use the address 614. When you run the diagnostic software described later on from bnlku7, you need to use this addressing scheme.
The following table give a list of addresses used by the
PPC/SFI modules and bnlku7 as a quick reference in case you are totally
confused by the discussion in the previous section.
So how does this table help me? This is what will happen when something goes wrong with the SSP's. You do an SSPINIT from DUI and the DUI window will report that there were problems during SSPINIT. You then have to scan the three PPC windows. One of them will have some error messages with memory dumps of the SSPs in error. Along with this output will be which SSP gave the problem and refer to it using the PPC/SFI Address. If you want to further diagnose the problem, you need to log onto bnlku7 and run the diagnostic program we have and use the bnlku7 address to talk to the SSP in error. These are two different addresses which talk to the same SSP. The table above is a quick reference guide. For example, the trigger ssp has gone wacko, you will get an error in PPC Window 'ppc01'. The SSP address it will report has having problems is 0x101. (The PPC software likes to put a 0x in front of all hex numbers.) When you log onto bnlku7, and run the ssp diagnostic program, you need to use the address 801.
Once an SSP refuses to to SSPINIT, then the next step is to run the SSP diagnostic tools. There are 3 programs used to do this. sspar, snoopy, and sspcheck. sspar is used to scan crates and cable segments and down load bits of code into SSP's, snoopy is used to debug ssp code, and sspcheck is a general purpose ssp tester. The testing sequence is as follows. 1) SSP goes bad. 2) Figure out which SSP has gone bad, or at least which cable segment is having problems. 3) run sspar to see if the bad ssp is addressable. 4) run sspcheck on the bad ssp if it is addressable.
Now that you know how to address an SSP, (and if you don't, don't bother to read further) the next step is to run sspar and sspcheck in order to further diagnose the problem with the SSP. sspar will be the first program you run in order to
access the cable segment to which the problem SSP is attached. Therefore a simple overview of SSPAR is needed. SSPAR is a program which does a lot, but basically I use it to address fastbus modules, load code into SSPs, scan the contents of fastbus crates and the cable segment. One can also load and read back values into fastbus boards with loadable registers. The first thing I do when running sspar is scan the local fastbus crate. One bnlku7 this means scanning the 'master crate'. I want to see if the ssp's in this master crate (referred to as the master ssps) are addressable. Next I load a simple program into one of the SSPs which then allows me to scan the cable segment that master ssp is hooked up to. Since I do this again and again, I have setup sspar on bnlku7 in such a way that the initialization of the master ssp's is done for you rather quickly. sspar, when executed, will execute a startup script much like a .login or .cshrc file. It will execute all the command in the file sspar_begin.com. So I have setup sspar_begin.com in ~online/ssp.dir to initialize all the ssps used in the DAQ system. Therefore you should start your ssp diagnostics by typing the following commands
1) Log into bnlku7 as online. The password is up on the white board in the counting house.
2) type the following commands after logging on:
> cd ssp.dir
(A lot of sspar output follows, click here to see it)
At this point you can enter some sspar command. For debugging the SSP's, there are 3 commands frequently used which are:
number in hex>
down <ssp program file name>
The slot command tells sspar to which slot you wish to direct the next IO commands. The scan command tells sspar to scan either the crate or the cable segment, depending on what you specified for the slot number in the scan command. The down command will down load a specified ssp program into an ssp specified by the slot command. In ~online/ssp.dir, there is one ssp program which contains the gateway software needed for bnlku7 to be able to scan a cable segment. Its called arbtst.ssp. A typical sequence of commands used to scan cable segment #1 from bnlku7 is:
Command: slot 8
Command: down arbtst
Command: slot 800
Scan from route: 000008h - From the CFI's crate segment
on through the SSP at geographic address: 8 ( 8h) to its cable segment
At: 1 ( 1h) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 2 ( 2h) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 3 ( 3h) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 4 ( 4h) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 7 ( 7h) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 8 ( 8h) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 9 ( 9h) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 10 ( Ah) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 11 ( Bh) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 12 ( Ch) is: SSP (SLAC Scanner Processor)................... ID: 0106
At: 13 ( Dh) is: SSP (SLAC Scanner Processor)................... ID: 0106
At:255 ( FFh) is: GAC card, cable seg............................. ID: 1002
What you see is a list of all the
SSP's found on cable segment #1. To scan cable segment #2, you use slot
a, and to scan cable segment #3, you use slot 6. The following is a quick
reference table on how to scan any one of the cable segments from bnlku7.
Command you issue
Cable segment from bnlku7:~online/ssp.dir in
you want to scan sspar.
Cable Segment #1 slot 8
Cable Segment #2
Cable Segment #3
That is all there is for sspar. Basically you can scan cable segments with it to see if the ssp which is having problems can be found by scanning the cable segment. If it cannot be found, then there is no use trying to run sspcheck, since sspcheck needs to be able to find the ssp on the cable segment before it can check it. Read further down on instructions on how to proceed if you cannot find the bad ssp during a scan command.
The next line of defense against the troublesome ssp is sspcheck. sspcheck will run a bunch of diagnostics on an ssp to see if its working or not. The basic idea for the guy on shift is, if it fails sspcheck, then something has to be done with the ssp. It could be that sspinit fails, and you think there is something wrong with the SSP. Running sspcheck will confirm whether the ssp is at fault or maybe something else is arye. To run sspcheck, you need to first make sure you can scan the ssp on the cable segment. Read the previous section on how to do this. If you do find the ssp listed when scanning the cable segment, then you are set to run sspcheck. To do so, log into bnlku7 as online, (the password is on the white board in the counting house,) and cd to ssp.dir. From there run sspar and exit. This will initialize all the ssps and do a scan of each cable segment for you. Then you run sspcheck by typing
It will ask you for the address
of the ssp you wish to check. It will the ask you again if the ssp you
want to check is really the one you want to check, just hit return, and
then continue to hit return to the rest of the questions. It will then
set out to check the ssp and at the end of the program, and tell you if
it passed all its tests or not. Click
here to view the full output of sspcheck.
The most common problem with an
SSP, has to do with its hybrids. There will be a situation when DAQ stops
working due to an SSP with a bad hybrid. These hybrids need to be found
and replaced. Click
here to read up on John Haggerty's write up on finding and fixing bad hybrids.
That does it for the tools used in debugging ssps. sspar and sspcheck. What follows is a discussion of how to use these tools to fix common ssp problems late at night, when you are all alone in the counting house, with the AGS horn blaring at you in effect screaming, "You are burning up precious beam!!!".
There is no set recipe for fixing an SSP problem in the DAQ system. There are some general guide lines I can provide from my several years experience in fixing ssp problems. It will eventually come down to your sleuthing ability to pin point the problem, since the SSP system in notorious for giving out false clues in trying to throw you off track.
Its the cable stupid!
You run sspinit from DUI and you get an error message on ppc01 window indicating a stuck bit when trying to load ssp code into the Trigger ssp. You first conclusion is that the Trigger SSP is bad and needs replacing. WRONG! The most likely problem is that something is wrong with the cable segment cable. The best way to diagnose this is to log onto bnlku7 as online, cd to ssp.dir, run sspar, exit and then run sspcheck on 801, the trigger ssp on cable segment #1. You should then run sspcheck on 802 and 803 and 80c. 802 is the FERA ssp which sits before the trigger ssp on the cable segment, 803 is ADC1 which sits just after the trigger ssp, and 80c is the last ssp on the cable segment #1 chain. If all those ssps sspcheck ok, then maybe the problem is with the ssp. But before ripping out the SSP, you should uncable the cable segment from that ssp and recable it again, making sure the cables are plugged in well. Run sspcheck one more time and verify that you still get your errors and then try a new SSP.
Its the terminator stupid!
You have been a good shift person, and ran sspcheck on a bunch of ssp's on the problem cable segment and found that there are several SSPs which have the same stuck bit, indicating a cable problem and not an SSP problem. You then start swapping cables around to see if you can find the bad cable but this seems futile. Try swapping the terminator cards. The cable segment, like all cables used to transmit a signal, needs to be properly terminated. Since we have 3 cable segments, there are 3 terminator cards in the system. One in 80C, one in A18 and on in 6 (the master ssp for cable segment #3.)
Reterminate the cable segment stupid!
You continue with your spate of bad luck, and messing around with cables, (Jiggling them, reseating them, swapping terminator cards around, does not work). And you still have problems ssp checking a bunch of ssps on the cable segment. The next thing to do is terminate the cable segment at a different spot. For example, cable segment #1 is really long and is terminated in crate 80c (crate CCDC). You should should power down the fastbus create 80c, remove the terminator, go over to crate 808, (TDC2), power it off, and plug the terminator into that crate, in effect removing all the CCD crates from cable segment #1. Try running sspcheck again on the remaining ssp's. If you still have problems, then try terminating the cable segment in a different crate.
Its the master ssp stupid!
After doing the above 3 bits of screwing around and you still have problems, then you can try replacing the ssp in the Master crate. Often a bad hybrid in the master ssp will show up as cable segment cable problems.
If you replace more than one SSP in the same slot, then ITS NOT THE SSP stpuid!
There is often a situation when the problem seems to come from only one ssp. (i.e. ssp checking every other ssp on the cable segment is done successfully and only that one ssp fails sspcheck.) You cry 'yeurikaa' and replace the ssp. A subsequent sspcheck of the new ssp in the same crate fails. Do not try swapping in another SSP. The problem is most likely in the crate and not the SSP. One trick to try at this point is to swap the suspected bad SSP with a neighboring SSP on the same cable segment. The problem should move to the new create following the bad ssp. If not, then your problem is most likely with the crate. (A bent pin on the fastbus backplane? This has occurred in the past!).
Enough of my ugly sarcasm. The general point I'm trying to make is that in most cases, the problem does not lie with the SSP but elsewhere. The most common problem has been with the cables and/or terminators. There have been very few hybrid problems ever since the cable segment has been split. But on July 1st, after having the DAQ on for 6 weeks, we did see our first hybrid problem. Let me finish with two ssp problem cases we had this year.
SSP Problem Study Case #1) DAQ was running fine. The DAQ is halted for some reason other than DAQ problems. (AGS goes down for some problem or other) It is decided to take advantage of the down time and fix some problem CCD channels. This means *uncabling the CCDs from cable segment #1*. This was done, the CCD channels were tweaked, the CCD's cabled back into cable segment #1 and off to start taking data by doing an SSP init. Trigger SSP fails to load, (The first ssp in the list of SSP's to be loaded and initialized). The guy on shift blames the trigger ssp, and swaps it and in doing so, accidently unplugs an ecl cable which is needed to cycle the DAQ. After replacing the SSP, he still has the same problems. After some more messing around, finally fixes the problem by reseating the cables in the CCD ssps which were originally uncabled. Now that a successful ssp init is achieved, the DAQ does not start. It just hangs. I have to come in at 2am to plug the ecl cable back in. (This was my fault, the cable was unlabeled and thus I could not diagnose the problem from home.) But a good typical example as how one can be fooled by cable problems and in the processes break a couple more things along the way.
SSP Problem Study Case #2) DAQ fails to start or SSP init. It was fine before. Problem indicates that ppc02 could not find ssp 0x101 on the cable segment. The guy on shift did the right thing, he reterminated the cable segment and eventually found that a problem existed with A01. He replaced the SSP in A01 and continued to have sspcheck problems, although not as bad as before. (With the original SSP, the cable segment was messed up, a lot of the SSPs were not 'scanable' by an sspar scan command, now only ssp A01 was not sspcheckable.) At this point I was called on to help. The problem turned out to be a bad hybrid in the original SSP and the spare ssp used to replace the problem SSP was not fully tested. This was my fault, I did not have a any properly labeled fixed SSP's for the guy on shift to use. The fix was to put the original bad SSP into the test crate, check the hybrid voltages, this pointed to 1 bad hybrid. It was replaced and the SSP was put back into its original crate and data taking resumed. I then spent some time on the second SSP to make sure it was working properly.
SSP Problem Study Case #3)
AGS goes off. CCD testing starts which requires
*uncabling the CCD crates*.
CCD testing is done, and sspinit problems start. The person on shift ran
sspcheck and found a stuck bit in SSP 80C (or 80B, I'm not sure). The memory
board is then replaced and SSP check works fine. My diagnostic is that
the memory board is fine, and what fixed the problem was uncabling and
recabilng the cable segment to the problem SSP.