Hello,

I am new to MS Word programming, currently, I am planning to do a
project in which aims to

1. Read every words in a word document and parse it and analyze it using
multiple data mining algorithms (they are very CPU intensive algorithm!)

2. Bold and highlight the analyzed words in the same document

I have really no idea where to start with, the main concern is to choose
an efficient method to implement the system.

After some searching in google, there are some suggestions:

1. Pure VBA implementation
2. C++/COM + VBA

Some people said C++/COM + VBA is even slower than pure VBA
implementation. Is it true? I would like to hear more suggestions on
high performance programming in Win Word.

Thanks

Nick

High Performance Programming in MS Word by DA

DA
Thu Jul 08 21:15:47 CDT 2004

Hi Nick,

Excuse my ignorance here, but I'm unclear as to what
you're trying to do here. When you say you're parsing
each word.. where are you going with it? Are these
algorithms you talk about outside of the VBA environment?
If that's the case, your performance issues are unlikely
to be in Word.

Also, how intensive can you get with analyzing a word?
Have you tried anything, suffered any performance
issues?..perhaps if you can add a few more details to
your original post we may be able to help you a bit
better.

Regards,
Dennis

>-----Original Message-----
>Hello,
>
>I am new to MS Word programming, currently, I am
planning to do a
>project in which aims to
>
>1. Read every words in a word document and parse it and
analyze it using
>multiple data mining algorithms (they are very CPU
intensive algorithm!)
>
>2. Bold and highlight the analyzed words in the same
document
>
>I have really no idea where to start with, the main
concern is to choose
>an efficient method to implement the system.
>
>After some searching in google, there are some
suggestions:
>
>1. Pure VBA implementation
>2. C++/COM + VBA
>
>Some people said C++/COM + VBA is even slower than pure
VBA
>implementation. Is it true? I would like to hear more
suggestions on
>high performance programming in Win Word.
>
>Thanks
>
>Nick
>.
>

Re: High Performance Programming in MS Word by Word

Word
Thu Jul 08 23:14:43 CDT 2004

G'day Nick <nick@heha.net.tw>,

<chuckles> You too huh. It's an interesting area. There are two main
methods for you to consider here.

Method 1 - Formatting is NOT important to your parse.

From VBA
Save the bloody file as text
Use DocStats as a rough guide to your word count.
Call your C# to go sicko speeds.

From C#
Serialize word structures as per MS Word (any non-alpha post alpha is
a new word start) into a bloody huge array which you can predetermine
using the docstats result as a parm.

Keep a 'done' list of serialised words worthy of marking. Re-enter the
Word document, obtain Document.Content.Words(offset) and mark
accordingly.



Method 2 - Formatting is important

For extreme speed, I would probably use a variant of Method 1 that
uses a HTML output to parse.

OTHERWISE

Any C would be only using Word calls anyway - as who wants to rebuild
an RTF processor - YUCK! Avoid it, stick with VBA, as you won't be
needing interface wrappers for all your calls it is probable it will
actually run a bit faster for you from VBA.

First up, all the collections are dynamic, so you really want to avoid
doing things like .Para(k) as when k gets to 100, Word has to quickly
serialise the first 100 paras in the defined range to get your answer.

If you move your range start ahead a para at a time and use para 1 its
much quicker and automatically delivers doc end when myRange.start is
at myRange.end.

You will need to know about Range objects, and then start looking at
.Paragraphs.Range.Words(n).Text.

There's obviously some tricks to getting this running really quick in
VBA, I outline numerous performance enhancements in my Word VBA for
Beginner's book from my website for a small fee.


Steve Hudson - Word Heretic
Want a hyperlinked index? S/W R&D? See WordHeretic.com

steve from wordheretic.com (Email replies require payment)


Nick reckoned:

>Hello,
>
>I am new to MS Word programming, currently, I am planning to do a
>project in which aims to
>
>1. Read every words in a word document and parse it and analyze it using
>multiple data mining algorithms (they are very CPU intensive algorithm!)
>
>2. Bold and highlight the analyzed words in the same document
>
>I have really no idea where to start with, the main concern is to choose
>an efficient method to implement the system.
>
>After some searching in google, there are some suggestions:
>
>1. Pure VBA implementation
>2. C++/COM + VBA
>
>Some people said C++/COM + VBA is even slower than pure VBA
>implementation. Is it true? I would like to hear more suggestions on
>high performance programming in Win Word.
>
>Thanks
>
>Nick


Re: High Performance Programming in MS Word by Nick

Nick
Fri Jul 09 07:56:01 CDT 2004

Hi Dennis,

Thanks for your reply.

>When you say you're parsing each word.. where are you going with it?
>Are these algorithms you talk about outside of the VBA environment?

The algorithm is to find some "features words" in the original document
via some data mining algorithms. For example, given an article, the
algorithm is applied to the document and some keywords are highlighted.
To make it simple, you can just think of that they are very CPU
intensive algorithms which analyzed each words in a text file.

Currently, I have a pure C/C++ implementation(need 20 sec to parse this
post, so imagine how intensive it is!) of the algorithm, but I can
rewrite it using VBA or as a COM object using VC++.

What I want to know is the pros and cons of doing so. (Pure VBA vs COM
object + VBA or other approach I don't know) , especially for the speed
factor.


Regards,
Nick


DA wrote:
> Hi Nick,
>
> Excuse my ignorance here, but I'm unclear as to what
> you're trying to do here. When you say you're parsing
> each word.. where are you going with it? Are these
> algorithms you talk about outside of the VBA environment?
> If that's the case, your performance issues are unlikely
> to be in Word.
>
> Also, how intensive can you get with analyzing a word?
> Have you tried anything, suffered any performance
> issues?..perhaps if you can add a few more details to
> your original post we may be able to help you a bit
> better.
>
> Regards,
> Dennis
>
>
>>-----Original Message-----
>>Hello,
>>
>>I am new to MS Word programming, currently, I am
>
> planning to do a
>
>>project in which aims to
>>
>>1. Read every words in a word document and parse it and
>
> analyze it using
>
>>multiple data mining algorithms (they are very CPU
>
> intensive algorithm!)
>
>>2. Bold and highlight the analyzed words in the same
>
> document
>
>>I have really no idea where to start with, the main
>
> concern is to choose
>
>>an efficient method to implement the system.
>>
>>After some searching in google, there are some
>
> suggestions:
>
>>1. Pure VBA implementation
>>2. C++/COM + VBA
>>
>>Some people said C++/COM + VBA is even slower than pure
>
> VBA
>
>>implementation. Is it true? I would like to hear more
>
> suggestions on
>
>>high performance programming in Win Word.
>>
>>Thanks
>>
>>Nick
>>.
>>

Re: High Performance Programming in MS Word by Nick

Nick
Fri Jul 09 08:01:54 CDT 2004

Hi,

Thanks for your reply first.

I think formating is NOT important for me, so I choose method 1:

>From VBA
> Save the bloody file as text
> Use DocStats as a rough guide to your word count.
> Call your C# to go sicko speeds.
>
> From C#
> Serialize word structures as per MS Word (any non-alpha post alpha is
> a new word start) into a bloody huge array which you can predetermine
> using the docstats result as a parm.
>
> Keep a 'done' list of serialised words worthy of marking. Re-enter the
> Word document, obtain Document.Content.Words(offset) and mark
> accordingly.

1. Why C#, wouldn't VC is much faster?

2. How to call the C# from VBA? Write the C# as a component? Sorry as I
am new to .NET, for example, for VC, should I use ATL instead?


Regards,
Nick


Word Heretic wrote:

> G'day Nick <nick@heha.net.tw>,
>
> <chuckles> You too huh. It's an interesting area. There are two main
> methods for you to consider here.
>
> Method 1 - Formatting is NOT important to your parse.
>
> From VBA
> Save the bloody file as text
> Use DocStats as a rough guide to your word count.
> Call your C# to go sicko speeds.
>
> From C#
> Serialize word structures as per MS Word (any non-alpha post alpha is
> a new word start) into a bloody huge array which you can predetermine
> using the docstats result as a parm.
>
> Keep a 'done' list of serialised words worthy of marking. Re-enter the
> Word document, obtain Document.Content.Words(offset) and mark
> accordingly.
>
>
>
> Method 2 - Formatting is important
>
> For extreme speed, I would probably use a variant of Method 1 that
> uses a HTML output to parse.
>
> OTHERWISE
>
> Any C would be only using Word calls anyway - as who wants to rebuild
> an RTF processor - YUCK! Avoid it, stick with VBA, as you won't be
> needing interface wrappers for all your calls it is probable it will
> actually run a bit faster for you from VBA.
>
> First up, all the collections are dynamic, so you really want to avoid
> doing things like .Para(k) as when k gets to 100, Word has to quickly
> serialise the first 100 paras in the defined range to get your answer.
>
> If you move your range start ahead a para at a time and use para 1 its
> much quicker and automatically delivers doc end when myRange.start is
> at myRange.end.
>
> You will need to know about Range objects, and then start looking at
> .Paragraphs.Range.Words(n).Text.
>
> There's obviously some tricks to getting this running really quick in
> VBA, I outline numerous performance enhancements in my Word VBA for
> Beginner's book from my website for a small fee.
>
>
> Steve Hudson - Word Heretic
> Want a hyperlinked index? S/W R&D? See WordHeretic.com
>
> steve from wordheretic.com (Email replies require payment)
>
>
> Nick reckoned:
>
>
>>Hello,
>>
>>I am new to MS Word programming, currently, I am planning to do a
>>project in which aims to
>>
>>1. Read every words in a word document and parse it and analyze it using
>>multiple data mining algorithms (they are very CPU intensive algorithm!)
>>
>>2. Bold and highlight the analyzed words in the same document
>>
>>I have really no idea where to start with, the main concern is to choose
>>an efficient method to implement the system.
>>
>>After some searching in google, there are some suggestions:
>>
>>1. Pure VBA implementation
>>2. C++/COM + VBA
>>
>>Some people said C++/COM + VBA is even slower than pure VBA
>>implementation. Is it true? I would like to hear more suggestions on
>>high performance programming in Win Word.
>>
>>Thanks
>>
>>Nick
>
>

Re: High Performance Programming in MS Word by Jonathan

Jonathan
Fri Jul 09 06:37:36 CDT 2004


"Nick" <nick@heha.net.tw> wrote in message
news:eDaMiROZEHA.3664@TK2MSFTNGP12.phx.gbl...
> Hello,
>
> I am new to MS Word programming, currently, I am planning to do a
> project in which aims to
>
> 1. Read every words in a word document and parse it and analyze it using
> multiple data mining algorithms (they are very CPU intensive algorithm!)

Like Steve said, the last thing you need is the formatting of the document
to slow you down. Read the Range.Text property of the document into a
string, and process that. Use whatever language you deicde is best for
string handling. The StringBuilder class in VB.NET is good if you have
concatenation to do, or there is an equivalent VB class module produced by
Karl peterson which also works very fast in VBA. Take a look at
www.mvps.org/vb/

>
> 2. Bold and highlight the analyzed words in the same document

If you know where in the original string your analysed word is, you can
probably get to the same character position in the original document. This
might get a bit hairy if the doc has tables & frames in it, you'll have to
experiment. At the worst, you can see how many times a aparticular word
occurs in the string and use the Find object to get to their equivalent
positions in the document

>
> I have really no idea where to start with, the main concern is to choose
> an efficient method to implement the system.
>
> After some searching in google, there are some suggestions:
>
> 1. Pure VBA implementation
> 2. C++/COM + VBA
>
> Some people said C++/COM + VBA is even slower than pure VBA
> implementation. Is it true? I would like to hear more suggestions on
> high performance programming in Win Word.

High performance programming is possible in word VBA, you just have to
choose your tools right and get your algorithm right. I would concebntrate
firast on the algorithm and get that as well-designed as possible, and then
choose your language/implementation. If you choose to use Word VBA, then
there are all kinds of speedup tricks that can be used, but the first item
of business would be to identify the bottlenecks (eg inner nested loops) and
see where the time is being taken.

Remember that it is no good getting the program to be fast if it produces
the wrong answer!


--
Regards
Jonathan West - Word MVP
www.intelligentdocuments.co.uk
Please reply to the newsgroup


Re: High Performance Programming in MS Word by Word

Word
Sat Jul 10 08:06:42 CDT 2004

G'day Nick <nick@heha.net.tw>,

I was purely talking compiled dll vs interpreted host-based scripting.
Whatever language you choose, it makes little difference to the end
result :-)

Steve Hudson - Word Heretic
Want a hyperlinked index? S/W R&D? See WordHeretic.com

steve from wordheretic.com (Email replies require payment)


Nick reckoned:

>Hi,
>
>Thanks for your reply first.
>
>I think formating is NOT important for me, so I choose method 1:
>
> >From VBA
> > Save the bloody file as text
> > Use DocStats as a rough guide to your word count.
> > Call your C# to go sicko speeds.
> >
> > From C#
> > Serialize word structures as per MS Word (any non-alpha post alpha is
> > a new word start) into a bloody huge array which you can predetermine
> > using the docstats result as a parm.
> >
> > Keep a 'done' list of serialised words worthy of marking. Re-enter the
> > Word document, obtain Document.Content.Words(offset) and mark
> > accordingly.
>
>1. Why C#, wouldn't VC is much faster?
>
>2. How to call the C# from VBA? Write the C# as a component? Sorry as I
>am new to .NET, for example, for VC, should I use ATL instead?
>
>
>Regards,
>Nick
>
>
>Word Heretic wrote:
>
>> G'day Nick <nick@heha.net.tw>,
>>
>> <chuckles> You too huh. It's an interesting area. There are two main
>> methods for you to consider here.
>>
>> Method 1 - Formatting is NOT important to your parse.
>>
>> From VBA
>> Save the bloody file as text
>> Use DocStats as a rough guide to your word count.
>> Call your C# to go sicko speeds.
>>
>> From C#
>> Serialize word structures as per MS Word (any non-alpha post alpha is
>> a new word start) into a bloody huge array which you can predetermine
>> using the docstats result as a parm.
>>
>> Keep a 'done' list of serialised words worthy of marking. Re-enter the
>> Word document, obtain Document.Content.Words(offset) and mark
>> accordingly.
>>
>>
>>
>> Method 2 - Formatting is important
>>
>> For extreme speed, I would probably use a variant of Method 1 that
>> uses a HTML output to parse.
>>
>> OTHERWISE
>>
>> Any C would be only using Word calls anyway - as who wants to rebuild
>> an RTF processor - YUCK! Avoid it, stick with VBA, as you won't be
>> needing interface wrappers for all your calls it is probable it will
>> actually run a bit faster for you from VBA.
>>
>> First up, all the collections are dynamic, so you really want to avoid
>> doing things like .Para(k) as when k gets to 100, Word has to quickly
>> serialise the first 100 paras in the defined range to get your answer.
>>
>> If you move your range start ahead a para at a time and use para 1 its
>> much quicker and automatically delivers doc end when myRange.start is
>> at myRange.end.
>>
>> You will need to know about Range objects, and then start looking at
>> .Paragraphs.Range.Words(n).Text.
>>
>> There's obviously some tricks to getting this running really quick in
>> VBA, I outline numerous performance enhancements in my Word VBA for
>> Beginner's book from my website for a small fee.
>>
>>
>> Steve Hudson - Word Heretic
>> Want a hyperlinked index? S/W R&D? See WordHeretic.com
>>
>> steve from wordheretic.com (Email replies require payment)
>>
>>
>> Nick reckoned:
>>
>>
>>>Hello,
>>>
>>>I am new to MS Word programming, currently, I am planning to do a
>>>project in which aims to
>>>
>>>1. Read every words in a word document and parse it and analyze it using
>>>multiple data mining algorithms (they are very CPU intensive algorithm!)
>>>
>>>2. Bold and highlight the analyzed words in the same document
>>>
>>>I have really no idea where to start with, the main concern is to choose
>>>an efficient method to implement the system.
>>>
>>>After some searching in google, there are some suggestions:
>>>
>>>1. Pure VBA implementation
>>>2. C++/COM + VBA
>>>
>>>Some people said C++/COM + VBA is even slower than pure VBA
>>>implementation. Is it true? I would like to hear more suggestions on
>>>high performance programming in Win Word.
>>>
>>>Thanks
>>>
>>>Nick
>>
>>