/bugzilla3/ Bug 1098 – [version 3.1.0_15042-51] clocksource/0: Time went backwards
Bug 1098 - [version 3.1.0_15042-51] clocksource/0: Time went backwards
: [version 3.1.0_15042-51] clocksource/0: Time went backwards
Status: NEW
Product: Xen
Unspecified
: unspecified
: x86-64 Linux-2.6
: P2 normal
Assigned To: Xen Bug List
:
:
:
  Show dependency treegraph
 
Reported: 2007-10-29 16:03 CDT by Ian Marlier
Modified: 2009-09-10 03:03 CDT (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ian Marlier 2007-10-29 16:03:16 CDT
I've been configuring a new machine, and have encountered a bug in Xen.

The hardware/distro/kernel info is:
opensuse 10.3
Linux offxen2 2.6.22.5-31-xen #1 SMP 2007/09/21 22:29:00 UTC x86_64 x86_64
x86_64 GNU/Linux
2 x Dual-Core AMD Opteron(tm) Processor 2212 HE
8 x 2GB PC2-5300 RAM
Xen 3.1.0_15042-51

I currently have three guest domains created:
- 1 opensuse 10.3, paravirtualized
- 1 Windows XP, HVM
- 1 Windows 2003, HVM

I can start all three domains, etc.

The trouble that I'm having appears as soon as I begin to stress-test the
configuration.  Running standard burn-in testing tools (BurnInTest, for
example), I quickly manage to lock the machine up.

Testing this morning, for example, I ran BurnInTest in all four domains --
dom0, and all three domU's.  It was configured to run with only CPU, Memory,
and Disk tests.  I'll grant that I'm putting far more than standard load on the
hardware, but, well...

Within 3 minutes of the tests beginning, the host machine became entirely
unresponsive at the console, or to the network.

After reboot, the following messages appear in the messages file on dom0: 
Oct 26 11:16:24 offxen2 kernel: clocksource/0: Time went backwards:
delta=-87834561 shadow=1219075978646569 offset=40156133
Oct 26 11:16:24 offxen2 kernel: clocksource/3: Time went backwards:
delta=-60020543 shadow=1219076192448184 offset=116224924
Oct 26 11:16:24 offxen2 kernel: clocksource/3: Time went backwards:
delta=-30041758 shadow=1219076192448184 offset=242467112
Oct 26 11:16:28 offxen2 kernel: clocksource/2: Time went backwards:
delta=-29692236 shadow=1219079978661601 offset=564764018
Oct 26 11:16:34 offxen2 kernel: clocksource/0: Time went backwards:
delta=-21729405 shadow=1219085978683108 offset=317335629
Oct 26 11:16:37 offxen2 kernel: clocksource/3: Time went backwards:
delta=-15880975 shadow=1219087978689164 offset=670825178
Oct 26 11:16:37 offxen2 kernel: clocksource/1: Time went backwards:
delta=-121188552 shadow=1219089233041322 offset=181068207
Oct 26 11:16:39 offxen2 kernel: clocksource/2: Time went backwards:
delta=-130702951 shadow=1219090051851342 offset=632907468
Oct 26 11:16:50 offxen2 kernel: clocksource/0: Time went backwards:
delta=-150041357 shadow=1219101192548457 offset=648236947
Oct 26 11:16:52 offxen2 kernel: clocksource/1: Time went backwards:
delta=-43577936 shadow=1219103192555550 offset=917191565
Oct 26 11:17:00 offxen2 kernel: clocksource/2: Time went backwards:
delta=-30133282 shadow=1219112051923316 offset=182801944

Similarly, these messages appear in the messages file in the opensuse
domU:
Oct 26 11:14:42 mx1 kernel: clocksource/0: Time went backwards:
delta=-30524524 shadow=1218974192040103 offset=299209428
Oct 26 11:14:42 mx1 kernel: klogd 1.4.1, ---------- state change
---------- 
Oct 26 11:14:50 mx1 kernel: clocksource/0: Time went backwards:
delta=-35339594 shadow=1218980978262608 offset=653280945
Oct 26 11:14:50 mx1 kernel: clocksource/1: Time went backwards:
delta=-32626156 shadow=1218980978262608 offset=685231574
Oct 26 11:15:06 mx1 kernel: clocksource/1: Time went backwards:
delta=-33094544 shadow=1218997051525512 offset=883292081
Oct 26 11:15:07 mx1 kernel: clocksource/1: Time went backwards:
delta=-28049569 shadow=1218998978340581 offset=27113639
Oct 26 11:15:07 mx1 kernel: clocksource/0: Time went backwards:
delta=-29452075 shadow=1218999192142335 offset=173497295
Oct 26 11:15:10 mx1 kernel: clocksource/0: Time went backwards:
delta=-24180432 shadow=1219001232726262 offset=495644020
Oct 26 11:15:12 mx1 kernel: clocksource/1: Time went backwards:
delta=-29407880 shadow=1219004232742124 offset=267503786
Oct 26 11:15:14 mx1 kernel: clocksource/1: Time went backwards:
delta=-72293790 shadow=1219005978366429 offset=618198368
Oct 26 11:15:19 mx1 kernel: clocksource/1: Time went backwards:
delta=-30010141 shadow=1219010051576699 offset=685063187
Oct 26 11:15:25 mx1 kernel: clocksource/0: Time went backwards:
delta=-34145209 shadow=1219016192222797 offset=464415813
Oct 26 11:15:26 mx1 kernel: clocksource/1: Time went backwards:
delta=-11964123 shadow=1219017192226704 offset=564340684
Oct 26 11:15:26 mx1 kernel: clocksource/0: Time went backwards:
delta=-11068356 shadow=1219017051577257 offset=721244578
Oct 26 11:15:30 mx1 kernel: clocksource/1: Time went backwards:
delta=-11507778 shadow=1219022192247457 offset=191448477
Oct 26 11:15:31 mx1 kernel: clocksource/0: Time went backwards:
delta=-15586265 shadow=1219022978434436 offset=66316116
Oct 26 11:15:35 mx1 kernel: clocksource/0: Time went backwards:
delta=-55640150 shadow=1219027051616803 offset=76664416
Oct 26 11:15:40 mx1 kernel: clocksource/1: Time went backwards:
delta=-27264629 shadow=1219031192284307 offset=827170075
Oct 26 11:15:48 mx1 kernel: clocksource/0: Time went backwards:
delta=-13974148 shadow=1219039232839657 offset=399204868
Oct 26 11:15:57 mx1 kernel: clocksource/0: Time went backwards:
delta=-32130780 shadow=1219047978530514 offset=790642402
Oct 26 11:15:57 mx1 kernel: clocksource/0: Time went backwards:
delta=-30011595 shadow=1219048051699939 offset=989293882
Oct 26 11:16:00 mx1 kernel: clocksource/0: Time went backwards:
delta=-16655720 shadow=1219051192343976 offset=787632598
Oct 26 11:16:03 mx1 kernel: clocksource/0: Time went backwards:
delta=-11904065 shadow=1219054192356289 offset=971736210
Oct 26 11:16:06 mx1 kernel: clocksource/1: Time went backwards:
delta=-23181648 shadow=1219057192369484 offset=459549949
Oct 26 11:16:07 mx1 kernel: clocksource/1: Time went backwards:
delta=-30345878 shadow=1219058978577439 offset=435960608
Oct 26 11:16:10 mx1 kernel: clocksource/0: Time went backwards:
delta=-31035353 shadow=1219061051756571 offset=795467413
Oct 26 11:16:11 mx1 kernel: clocksource/0: Time went backwards:
delta=-20089266 shadow=1219062978593081 offset=101685066
Oct 26 11:16:16 mx1 kernel: printk: 50269 messages suppressed.
Oct 26 11:16:16 mx1 kernel: clocksource/1: Time went backwards:
delta=-24188173 shadow=1219067232963516 offset=765471054
Oct 26 11:16:21 mx1 kernel: printk: 28000 messages suppressed.
Oct 26 11:16:21 mx1 kernel: clocksource/1: Time went backwards:
delta=-30051890 shadow=1219073051795398 offset=556484104
Oct 26 11:16:26 mx1 kernel: clocksource/0: Time went backwards:
delta=-60024893 shadow=1219077192453422 offset=815345740
Oct 26 11:16:31 mx1 kernel: printk: 1 messages suppressed.
Oct 26 11:16:31 mx1 kernel: clocksource/0: Time went backwards:
delta=-805691001 shadow=1219082978673021 offset=152356711
Oct 26 11:16:37 mx1 kernel: printk: 2 messages suppressed.
Oct 26 11:16:37 mx1 kernel: clocksource/1: Time went backwards:
delta=-10843062 shadow=1219088233054364 offset=442409918
Oct 26 11:16:43 mx1 kernel: clocksource/1: Time went backwards:
delta=-40515520 shadow=1219094233037687 offset=442490158
Oct 26 11:16:55 mx1 kernel: printk: 10106 messages suppressed.
Oct 26 11:16:55 mx1 kernel: clocksource/1: Time went backwards:
delta=-50129861 shadow=1219106192568569 offset=847989665

For what it's worth, we saw this same issue several weeks ago, when we
first configured the machine, with only light load on the box.  (A
single opensuse 10.3 guest domain, being tested as a mail relay,
receiving and doing spam analysis on about 10-20 messages per minute.)
So, while the load I was putting on is quite heavy, I do know that the
same problem can occur in a non-stress environment.  It just seems to
happen much more quickly when being stressed.
Comment 1 Ian Marlier 2007-10-29 16:09:33 CDT
There is a confirmed workaround for this issue.

By changing the clocksource of Dom0 from "xen" to "jiffies", the issue appears
to be eliminated.  After doing so, I was able to run burnin testing in dom0 and
three domU's for 72 hours without any errors.

This change is made by doing
echo "jiffies" >
/sys/devices/system/clocksource/clocksource0/current_clocksource

Note that although the error messages appeared from all of the clocksources on
the machine (clocksource/0, clocksource/1, etc), changing only the
current_clocksource for clocksource0 was enough to fix this.  The others were
unaltered.
Comment 2 Ian Marlier 2007-10-29 16:12:24 CDT
Additionally, this bug is similar to bug 195:
http://bugzilla.xensource.com0

However, that bug was reported against the 2.0 codebase, and I cannot tell
whether it is actually the same issue.
Comment 3 Atsushi SAKAI 2007-10-30 23:13:14 CDT
This is my guess.

You should test with vcpu pin. It would also solve this problem.
If my guess is correct, You shoud fix xen_clocksource_read() increment
monotonically.
currently vcpu migration causes time went backwards.
Comment 4 Ian Marlier 2007-10-31 13:52:27 CDT
> This is my guess.
> 
> You should test with vcpu pin. It would also solve this problem.
> If my guess is correct, You shoud fix xen_clocksource_read() increment
> monotonically.
> currently vcpu migration causes time went backwards.
> 

I tested the system, following this advice, and there was no change in
behavior.

I pinned the CPU's this way:
offxen2:~ # xm vcpu-list
Name                              ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                           0     0     0   -b-    1146.3 0
Domain-0                           0     1     1   r--      93.8 1
Domain-0                           0     2     2   -b-     234.5 2
Domain-0                           0     3     3   -b-    1022.0 3
SUDCX                              2     0     0   -b-      26.8 0
SUDCX                              2     1     1   -b-      27.3 1
SUDCX                              2     2     2   -b-      21.3 2
SUDCX                              2     3     3   -b-      15.9 3
alpharobot                         1     0     0   -b-      58.3 0
alpharobot                         1     1     1   -b-      55.2 1
mx1                                3     0     2   -b-      10.9 2
mx1                                3     1     3   -b-       7.5 3

And left the clocksource set to "xen" -- I rebooted the machine so it was in a
fully default state.
offxen2:~ # cat
/sys/devices/system/clocksource/clocksource0/current_clocksource 
xen 

The VMs started up without trouble.

As soon as I began burn-in testing in one of the domains, occasional messages
started appearing in the logs about time going backward.

From the time that I got the burn-in tests going on all four of the domains,
the messages were appearing 100 or more times per minute.  After approximately
9 minutes of burnin testing in all four domains, the machine became
unresponsive at the console and to network, and had to be hard reset.

So, it appears that vcpu-pin does not have any effect on this issue.
Comment 5 Atsushi SAKAI 2007-11-01 04:50:27 CDT
Please let me clearify.

> As soon as I began burn-in testing in one of the domains, occasional messages
> started appearing in the logs about time going backward.

Before above testing, are all Domain vcpus pinned?
Comment 6 Ian Marlier 2007-11-01 05:24:05 CDT
(In reply to comment #5)
> Please let me clearify.
> 
> > As soon as I began burn-in testing in one of the domains, occasional messages
> > started appearing in the logs about time going backward.
> 
> Before above testing, are all Domain vcpus pinned?
> 

Yes, prior to beginning any testing, I pinned all CPUs, in all domains
(including dom0), as per the output of "xm vcpu-list" posted previously.
Comment 7 Atsushi SAKAI 2007-11-01 19:52:58 CDT
The error message is opensuse 2.6.22.5-31 only (xen_clocksource_read()).
Not exist in Linux 2.6.23 xen_clocksource_read().
Anyway, I look around the code.
Comment 8 Atsushi SAKAI 2007-11-01 20:34:18 CDT
You should replace the function from opensuse kernel to linux 2.6.23
opensuse code does not consider sync to hypervisor.

====opensuse 2.6.22 xen3-patch-2.6.18 ===
+static cycle_t xen_clocksource_read(void)
+{
+    int cpu = get_cpu();
+    struct shadow_time_info *shadow = &per_cpu(shadow_time, cpu);
+    cycle_t ret;
+
+    get_time_values_from_xen(cpu);
+
+    ret = shadow->system_timestamp + get_nsec_offset(shadow);
+
+    put_cpu();
+
+#ifdef CONFIG_SMP
+    for (;;) {
+        static cycle_t last_ret;
+#ifndef CONFIG_64BIT
+        cycle_t last = cmpxchg64(&last_ret, 0, 0);
+#else
+        cycle_t last = last_ret;
+#define cmpxchg64 cmpxchg
+#endif
+
+        if ((s64)(ret - last) < 0) {
+            if (last - ret > permitted_clock_jitter
+                && printk_ratelimit())
+                printk(KERN_WARNING "clocksource/%d: "
+                       "Time went backwards: "
+                       "delta=%Ld shadow=%Lu offset=%Lu\n",
+                       cpu, ret - last,
+                       shadow->system_timestamp,
+                       get_nsec_offset(shadow));
+            ret = last;
+        }
+        if (cmpxchg64(&last_ret, last, ret) == last)
+            break;
+    }
+#endif
+
+    return ret;
+}

==> to Linux 2.6.23 xen_clocksource_read.

static cycle_t xen_clocksource_read(void)
{
        struct shadow_time_info *shadow = &get_cpu_var(shadow_time);
        cycle_t ret;
        unsigned version;

        do {
                version = get_time_values_from_xen();
                barrier();
                ret = shadow->system_timestamp + get_nsec_offset(shadow);
                barrier();
        } while (version != __get_cpu_var(xen_vcpu)->time.version);

        put_cpu_var(shadow_time);

        return ret;
}
Comment 9 Ian Marlier 2007-11-01 20:39:09 CDT
I've actually gotten this error message on several versions of the 2.6.22
kernel (all of them opensuse packaged, from their update sources).  Both the
original install kernel (the version number of which I don't remember off the
top of my head), and the most recent update kernel (2.6.22-5.31-xen) were
subject to this bug.

Can you confirm that the bug is fixed in the 2.6.23 tree?  Or only that the
error message is gone (which leaves open the possibility that the bug may still
exist)?  Please excuse the scepticism, but I have run into cases in the past
where error messages have been eliminated, while the bugs that triggered them
remained...
Comment 10 Atsushi SAKAI 2007-11-01 21:17:45 CDT
I am not run 2.6.23 just code base survey.
You say problem still keeps on 2.6.23.
So I plan to fix it, but for me, the bug is not urgent.

Comment 11 Ian Marlier 2007-11-02 05:10:00 CDT
(In reply to comment #10)
> I am not run 2.6.23 just code base survey.
> You say problem still keeps on 2.6.23.
> So I plan to fix it, but for me, the bug is not urgent.
> 
> 
> 

No, sorry, I think I confused things.  I have _not_ tried 2.6.23, just various
kernels in the 2.6.22 series.

I'll see if I can find a 2.6.23 kernel to work with today.
Comment 12 Ian Marlier 2007-11-07 11:27:50 CST
I have been able to confirm that the same problem exists in the 2.6.23 kernel
(or, at the very least, in the version of the 2.6.23 kernel that's currently in
the opensuse factory tree).

Kernel and xen version information:
offxen2:~ # uname -a
Linux offxen2 2.6.23.1-12-xen #1 SMP 2007/11/01 23:00:04 UTC x86_64 x86_64
x86_64 GNU/Linux
offxen2:~ # xm info
host                   : offxen2
release                : 2.6.23.1-12-xen
version                : #1 SMP 2007/11/01 23:00:04 UTC
machine                : x86_64
nr_cpus                : 4
nr_nodes               : 1
sockets_per_node       : 2
cores_per_socket       : 2
threads_per_core       : 1
cpu_mhz                : 2010
hw_caps                :
178bfbff:ebd3fbff:00000000:00000010:00002001:00000000:0000001f
total_memory           : 16383
free_memory            : 15638
max_free_memory        : 15954
max_para_memory        : 15950
max_hvm_memory         : 15904
xen_major              : 3
xen_minor              : 1
xen_extra              : .0_15042-67
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : 15042
cc_compiler            : gcc version 4.2.3 20071030 (prerelease) (SUSE Linux)
cc_compile_by          : abuild
cc_compile_domain      : suse.de
cc_compile_date        : Sat Nov  3 23:25:20 UTC 2007
xend_config_format     : 4
offxen2:~ # 


Once again, I started three domains (one opensuse 10.3, one Windows Server
2003, one Windows XP).  In each of those three domains, I started a run of
BurnInTest, testing the CPU, memory, and disk access.

These messages appeared in /var/log/messages:
Nov  7 12:14:32 offxen2 kernel: [ 4106.857173] clocksource/0: Time went
backwards: delta=-30792473 shadow=4106017137892 offset=840032129
Nov  7 12:14:32 offxen2 kernel: [ 4107.514152] clocksource/2: Time went
backwards: delta=-30007277 shadow=4107017063292 offset=497083784
Nov  7 12:14:55 offxen2 kernel: [ 4130.749697] clocksource/3: Time went
backwards: delta=-90267529 shadow=4130007803982 offset=741888839
Nov  7 12:14:58 offxen2 kernel: [ 4133.152023] clocksource/2: Time went
backwards: delta=-28481309 shadow=4133007815495 offset=144202833

Immediately thereafter, the machine locked up completely.  All network
connectivity was lost, as was the ability to operate the machine at the
console.  As far as I can tell, no messages were output to the console, but it
should be noted that I was not physically at the console at the time.
Comment 13 Antono Vasiljev 2008-11-23 21:26:28 CST
Confirm this bug on debian lenny:

Linux lenny 2.6.26-1-xen-amd64 #1 SMP Sat Nov 8 21:20:04 UTC 2008 x86_64
GNU/Linux


echo "jiffies" >
/sys/devices/system/clocksource/clocksource0/current_clocksource

Fixed the problem.