Oct 22 1999.

Running Too Long, a true story.


Note if you read the timeline you will see this took a LONG time to fix

If VisualWorks was an open source product I could have fixed it the same week I found it, but no a few years needed to pass... The bug was found in 1996, and fixed in 1998 or was that 1999. At least the vendor never lost track of the problem. The intern solution was to cycle the power every two weeks, perhaps this was needed since it was hosted under Windows NT which isn't noted for its stability either.


From: Damon Lease
Date: 23 Oct 1996.

John,

Steve Miner is gone for a while on vacation, so this case is back in my
hands, right where it started. I've read over the case notes, and I've got
a few questions and suggestions for you.

First of all, Steve suggested that you upgrade to the VW2.5.1 OE. I concur
with that suggestion, and if you can do it, I think you should. I know that
we talked before about the fact that you cannot upgrade your image, but I
think an OE upgrade is still a good idea. Also, can you consider an upgrade
in a test environment (such as what I'm about to suggest)?

The other suggestion depends somewhat on computing resources that you have
available, but it could help pinpoint the cause of the problem some more.

Do you have any machines where you could run a long term test? What I was
thinking about doing was starting an image on a machine at a measured time
twice a day for three straight days. Each day, start one of the images with
the 2.5 OE and the other with the 2.5.1 OE. Monitor the progress of these
images, and if they crash, note when this happens. If the behavior is
consistent, meaning that (Crash time - Start time) = C, where C is any constant
value, and the results are the same for both OEs, we will have some good
evidence to work with. You already suspect that C ~= 25 days. And, as you
noted 2**31 milliseconds = 24.86 days, so you may be on to something here.
However, we need to confirm this a bit more if possible.

I will ask engineering people here if they are aware of anything that could be
doing this in a vanilla image. While I am doing that, it would be helpful if
you could think of anything that might be occurring in your application that
would track time this way.

Lastly, how was this deployed image built? Is it a headless image? Do you
have a visual.err file that you could provide to us? If not, can you add
the headless code to inlcude the creation of the visual.err file that helps
us to debug the problem when a crash does occur?


Damon
Technical Support ParcPlace-Digitalk, Inc., Sunnyvale, CA
Phone: 800-727-2555 FAX: 408/481-9096
Email: support-vw@parcplace.com
______________________________________________________________________
From: "Mike Patrick" <m_patrick@bc.sympatico.ca>
To: "John M McIntosh" <johnmci@ibm.net>
Subject: VW Timer Crash
Date: Thu, Dec 10, 1998, 4:29 PM


Hello John:

A long time ago (months) I put a posting in comp.lang.smalltalk about a problem I was having with VisualWorks images crashing after they had been running for a long time. (I am finding that my images crash with certain predictability after they have been running for 24 days). Someone replied describing a problem they had found with Timers having this problem. I have lost their original response and I was wondering it may have been you as I seem to recall discussing this with you at OOPSLA. If it was you then could you please describe that problem again. If not, then have a great Christmas.


Thank you,

Mike Patrick

______________________________________________________________________

From: "Mike Patrick" <m_patrick@bc.sympatico.ca>
To: "John M McIntosh" <johnmci@ibm.net>
Subject: VW Timer Crash

Hi John:

Yes, this seems to be like our problem. Can you tell me what version of VW you discovered this in and if ObjectShare provided a fix for it? We are using VW 2.5.1 and are pretty much stuck with it because upgrading would trigger a round of (expensive) regression testing. Do you know if there is a way of programming around the problem at the application layer?

Thanks for your help John.

Mike

-----Original Message-----
From: John M McIntosh <johnmci@ibm.net <mailto:johnmci@ibm.net> >
To: Mike Patrick <m_patrick@bc.sympatico.ca <mailto:m_patrick@bc.sympatico.ca> >
Date: December 10, 1998 7:04 PM
Subject: Re: VW Timer Crash


______________________________________________________________________
>
>Return-Path: <sminer@central.parcplace.com <mailto:sminer@central.parcplace.com> >
>From: Stephen Miner <sminer@parcplace.com>
>To: johnmci@lsil.com
>Subject: Re: Case 31855; crash after 25 days
>Reply-To: support-vw@parcplace.com
>X-Casenum: 31855
>Content-Type: text
>
>
>>Each crash appears to happen after the image runs for 25 days.
>>Is there a counter in the image related to time that overflows and
>>causes the crash?
>
>Our engineering department has discovered a bug that may explain the
>crashes that you've seen. The solution will require a new revision to
>the Object Engine. We don't have a schedule for the fix yet.
>
>The newly discovered bug is exposed (under certain conditions) by the
>fix for a problem concering the wrap around for a counter associated
>with a delay. The delay fix was "correct" but it set up a situaion
>that allowed a long standing bug to surface. (It was an accident that
>this bug was prevented from occuring previously for unreleated
>reasons.) I'm sorry I can't be more precise with a description, but
>the problem is still under investigation.
>
>I just wanted you to know that we're still working on the problem and
>that I hope a fix will be available in a future release.
>
>--
> Steve Miner phone: (800) 727-2555
> ParcPlace-Digitalk ObjectSupport fax: (408) 481-9096
> <http://www.parcplace.com/support> email: support-vw@parcplace.com
>
>

>Case 31855: Server with NT talking to Sybase crashes after working fine for a month.
>Case 31855: AR-26519 BugID 222
>
> You already suspect that C ~= 25 days. And, as you
>noted 2**31 milliseconds = 24.86 days, so you may be on to something here.
>However, we need to confirm this a bit more if possible.
>
>>Each crash appears to happen after the image runs for 25 days.
>>Is there a counter in the image related to time that overflows and
>>causes the crash?
>
>>Our engineering department has discovered a bug that may explain the
>>crashes that you've seen. The solution will require a new revision to
>>the Object Engine. We don't have a schedule for the fix yet.
>>
>>The newly discovered bug is exposed (under certain conditions) by the
>>fix for a problem concering the wrap around for a counter associated
>>with a delay. The delay fix was "correct" but it set up a situaion
>>that allowed a long standing bug to surface. (It was an accident that
>>this bug was prevented from occuring previously for unreleated
>>reasons.) I'm sorry I can't be more precise with a description, but
>>the problem is still under investigation.
>>
>>I just wanted you to know that we're still working on the problem and
>>that I hope a fix will be available in a future release.

>Daryl, I reviewed my notes about the server problem
>2^32 is 49.7 days but the problem occurs at 2^31 which is 24.9
>days. The case number is 31855. Resolving the problem has been
>slow since it only happens after 25 of non-stop running which
>makes debugging difficult. Changing the clock on the machine
>doesn't affect or create the problem.
>
>
>Also
>Allen Wirfs-Brock
>Chief Scientist
>allen@parcplace.com
>503-691-0800x235

----------