We are having a problem and hopefully someone in this
newsgroup with some internals of WinCE experience can
help.

Platform: National Semiconductor's SC1200 Geode (actually
an SC1200UCL-266, now owned by AMD), 266 MHz, 64MB SDRAM,
NTSC video out, no hard drives, USB network.

Executive summary: Browser performance of Windows CE.NET
4.2 is 50% of the performance of Windows ME. While there
are smaller contributing factors (temp files and the CE
filesystem compression, USB network drivers a bit
slower), the largest factor appears to be the division of
the CPU between running the browser and spending time in
various threads of drivers.exe.

This characteristic will impact any application that
interacts with the screen.

According to the folks that actually build the image, the
platform is setup in release mode, not in debug mode.

The root question is "How can this be fixed or tweaked?"

Details (of which I've eliminated a lot in an effort to
cut to the chase):

One critical question: "Is the CPU running as fast in
both environments?"

To answer this, the Sieve of Eratosthenes Benchmark
program was located on the web, compiled and put into the
CE build. Calculating primes up to 28 million was the
value used in all the sieve tests.

On Windows ME, it ran in 16 seconds.
On Windows CE, it ran in 20 seconds.

If you pipe the output to a text file:

On Windows ME, it ran in 16 seconds.
On Windows CE, it ran in 9 seconds.

If you telnet into the WinCE system and run the Sieve to
the telnet session's output, it runs in 9 seconds.

The output of the sieve is a message every time it finds
2000 primes totaling about 20 KB of text, almost all of
it is on one line that overwrites so the screen does not
scroll during the calculations.

When this piped to text file output is sent to the screen
via the cmd.exe prompt's type command, it takes less than
2 seconds to display the entire file.

The sum of the parts (9 to crunch, 2 to display) is
significantly less than the whole (display and crunch)
even though the first case has extra work that is
happening not present in the case where both happen at
once -- opening up the file, writing data, compressing
the data, closing the file, opening the file, reading &
expanding)

There is something about WinCE and how it interacts with
the screen that makes a program that is both crunching
data and displaying information iteratively cost nearly
twice as much time as a program that first does all the
crunching and then all the displaying in one burst.

The iterative nature of a browser (get the HTML, display
some text, spawn two threads getting two graphics, #1
arrives so render it while #2 is still arriving, #1 is
done, start downloading #3, #2 is now here so render
it, ...) follows the same compute-display-compute-display
pattern the sieve is going through and pays the same
penalty for this behavior.

To further prove and attempt to isolate, both the sieve
to a file (fast), sieve to the screen (slow), and
browsing (slow) were run and thread activity captured
with the Windows CE Remote Kernel Tracker.

With sieve printing to the screen, the CPU alternates
between sieve.exe and device.exe, looking like a square
wave with 70% of the time in device.exe and 30% of the
time in sieve.exe. I have the actual screen captures and
data files if anyone is interested.

When the output is piped to a file, 95% of the time is
spent in sieve with what look like very short "clicks"
(viewing audio on an oscilloscope) of CPU time up to
device.exe.

Curiously, a modified version of sieve that doesn't print
anything until the final number of primes is found
visually looks the same as when the output was piped to a
file.

It almost looks like when a process wants to do something
to the screen, it calls some CE driver that formulate the
screen activity into chunks of commands which are then
queued up into another process' input queue. When that
other process wakes up, it takes the messages off its
input queue and actually renders the screen data. That
message passing overhead and scheduling between the two
processes seems absent when the whole screen activity is
sent out all at once, as when sending the saved output
text file to the screen.

Device.exe has > 50 threads -- if someone can explain how
to translate thread ID into something more meaningful
(like a .DLL name, or driver name, or anything) I'd love
to answer exactly where lots of the CPU time is spent and
have further clues as to where to look. But it bounces
all around ~8 of device.exe's threads with lots of "Wait
for multiple objects" synchronization icons according to
the kernel tracker.

WinCE's internet explorer exhibits similar though much
more complicated patterns as it is getting data off the
network and has many internal threads (12) bouncing all
around. Bottom line is IE looks like it is suffering from
lack of CPU due to drivers.exe just like sieve is.

Performance was measured by comparing the download of the
same data repeatedly over time from the same server
confirms it runs at 50% of the speed Windows ME is
running at on the same platform.

I really expected WinME on a hard drive to be slower, not
faster than WinCE running entirely out of RAM.

If anyone has any ideas on how to make things work
better, any assistance is greatly appreciated.

Thanks in advance!

David Soussan

My Email is the first two letters of my first name, the
first letter of my last name, at yahoo dot com.


-------

Other interesting performance characteristics found
during this investigation:

The impact of WinCE's filesystem compression as applied
to network throughput were measured by downloading /
uploading both a very large .zip file [expensive to run
compression on] and a file filled with 00 bytes [very
cheap to run compression on].

Pushing data into WinCE, we measured 353 KB/s for a zero
filled file and 211 KB/s for the big .ZIP file. The cost
of compression hit the network bandwidth down to 65% of
its maximum potential.

The reverse was not true; for pulling data out of WinCE
we measured ~253 KB/s no matter which file was pulled.

-------

It makes no difference to either ME or CE if the graphics
are easily compressed or actual graphics that aren't as
easily compressed on the net throughput browser speed. It
also makes no difference if the image is actually fully
rendered to the screen or not -- shrinking the window so
the image doesn't have to be displayed did not improve
the page refresh rate by a significant amount.

-------

Re: Windows CE browser performance problems (very detailed) by Steve

Steve
Tue Feb 10 17:00:00 CST 2004

Obviously something just ain't quite right there.... ;-) You do however need
to be comparing apples to apples. using the command prompt dumping text out
to the screen involves a few extra layers. The application calls printf()
printf in turn callsWriteFile which transitions through a Kernel trap into
device.exe and from their into the console driver. The console driver then
uses GDI calls to write text on the screen etc... each of which incurs
another kernel trap that takes it into GWES.EXE where it is processed and
turned into a series of calls to the display driver. The console driver is
not involved in the web browser case so you "shouldn't" be seeing
significant activity there when using the browser.


--
Steve Maillet (eMVP)
Entelechy Consulting
smaillet_AT_EntelechyConsulting_DOT_com



Re: Windows CE browser performance problems (very detailed) by anonymous

anonymous
Tue Feb 10 17:51:07 CST 2004

David

"I really expected WinME on a hard drive to be slower, not faster than WinCE running entirely out of RAM.

This expection is not reasonable, if fact the reverse can be true. While WinME does boot from the harddrive, WinME can also swap everything it doesn't need at the moment into the swap file, and use all of ram for the current running application. On the other hand, RAM is being tied up under CE with the image, thus giving CE much less RAM to work with. If your image is 16Meg, that is reduction of RAM by 25 Percent, that is hugh

Also note that many things are compressed, and need to be decompresses on the fly, one of them are the system fonts, if you stored them as not compresses, you will see an improvement, but the image size with also increase.

You might want to try popping 128meg of ram into a system, reconfigure a platform for 128 meg and not have the system fonts compressed

Rober


Re: Windows CE browser performance problems (very detailed) by Jason

Jason
Tue Feb 17 10:49:03 CST 2004

One of the more interesting performance differences between Windows ME and
Windows CE arises from how these OSs handle system calls. Windows CE uses a
model based on execeptions to make a call into another process space such as
GWES or NDIS. For example, if the application "IE" wants to draw something
on the screen, one function it can call is BltBlt(). This function is
actually implemented in GWES, coredll.dll and the display driver. In order
to execute this function, the application will cause an execption to occur
(in the code in coredll.dll). The CE kernel takes over and passes this call
to GWES and executes the function and then returns the thread back to the
calling process. This is done for most all interprocess calls. (There are
very few exceptions in all-kernel mode, but not many GWES operations if
any.) This constant execption handling is more expensive than Windows ME's
ability to make cheaper GWES cals into kernel. On an X86, Microsoft has
shown that Windows CE 4.1 is about 20-40% slower than IE on XP. See:
"Performance Test Methodologies for Windows CE .NET" on MSDN for more info.
On identical hardware, this can be attributable to many factors including
networking driver performance, display driver performance and compiler and
kernel performance. As expained in the above article, IBench is a good tool
for standardizing IE performance measurements. Can you run IBench in both
CE and ME and post these numbers? It would also help to get IBench numbers
for XP on this hardware since CE 4.2 is closer to XP than ME for networking.
The first step in finding the speed difference would be to get your system
set-up so that you are seeing the same differences that are described in
this article. This will then help to rule out problems with the setup of
your image.

Jason Browne
BSquare Corporation
Microsoft eMVP


"Robert Magyar" <anonymous@discussions.microsoft.com> wrote in message
news:234D78BE-8B22-418E-9B2A-E79E1C7C8936@microsoft.com...
> David,
>
> "I really expected WinME on a hard drive to be slower, not faster than
WinCE running entirely out of RAM."
>
> This expection is not reasonable, if fact the reverse can be true. While
WinME does boot from the harddrive, WinME can also swap everything it
doesn't need at the moment into the swap file, and use all of ram for the
current running application. On the other hand, RAM is being tied up under
CE with the image, thus giving CE much less RAM to work with. If your image
is 16Meg, that is reduction of RAM by 25 Percent, that is hugh.
>
> Also note that many things are compressed, and need to be decompresses on
the fly, one of them are the system fonts, if you stored them as not
compresses, you will see an improvement, but the image size with also
increase.
>
> You might want to try popping 128meg of ram into a system, reconfigure a
platform for 128 meg and not have the system fonts compressed.
>
> Robert
>



Re: Windows CE browser performance problems (very detailed) by anonymous

anonymous
Thu Feb 19 09:33:30 CST 2004

Jason;
Excellent suggestion!

We've got a support call into MS now on this issue. I'll
dig up that article / tool and see what I can shake out
of it.

I appreciate the pointer!

David Soussan

>-----Original Message-----
>One of the more interesting performance differences
between Windows ME and
>Windows CE arises from how these OSs handle system
calls. Windows CE uses a
>model based on execeptions to make a call into another
process space such as
>GWES or NDIS. For example, if the application "IE"
wants to draw something
>on the screen, one function it can call is BltBlt().
This function is
>actually implemented in GWES, coredll.dll and the
display driver. In order
>to execute this function, the application will cause an
execption to occur
>(in the code in coredll.dll). The CE kernel takes over
and passes this call
>to GWES and executes the function and then returns the
thread back to the
>calling process. This is done for most all interprocess
calls. (There are
>very few exceptions in all-kernel mode, but not many
GWES operations if
>any.) This constant execption handling is more
expensive than Windows ME's
>ability to make cheaper GWES cals into kernel. On an
X86, Microsoft has
>shown that Windows CE 4.1 is about 20-40% slower than IE
on XP. See:
>"Performance Test Methodologies for Windows CE .NET" on
MSDN for more info.
>On identical hardware, this can be attributable to many
factors including
>networking driver performance, display driver
performance and compiler and
>kernel performance. As expained in the above article,
IBench is a good tool
>for standardizing IE performance measurements. Can you
run IBench in both
>CE and ME and post these numbers? It would also help to
get IBench numbers
>for XP on this hardware since CE 4.2 is closer to XP
than ME for networking.
>The first step in finding the speed difference would be
to get your system
>set-up so that you are seeing the same differences that
are described in
>this article. This will then help to rule out problems
with the setup of
>your image.
>
>Jason Browne
>BSquare Corporation
>Microsoft eMVP