AVS/Express Optimization Tips & Tricks

Compiled by Mario Valle – AVS Italy (now at CSCS)
May 6^st, 2003 – Version 1.7.2

Introduction

This is a compilation of AVS/Express speedup tricks collected from various sources. The list does not cover user code optimization, only AVS/Express related suggestions. There is no particular order. Remember only that they are not absolute guidelines: the speedup depends on the data set and the machine configuration.

Remember also that premature optimization can have adverse effect on the application structure and development time. You must balance development time with performance. Obviously using low-level OpenGL calls create less overhead than AVS/Express, but the development time is guaranteed to be longer.

Thanks to all who have contributed to this document!

Document Map

Application Structure and V Code
Field Data
Simplify
User Interface Suggestions
Rendering
Saving Memory
Specific Applications
Special Techniques
Application Build
Development Environment Speedup
How to Measure Performance
Points to be investigated (and maybe added)

Application Structure and V Code

Remember that sometimes the academic elegance of a solution can be interesting only in an academic context. Real world follows its (divergent) path.

For example a real elegant subdivision in modules with each processing each a different data type can create performance problems if you want to read and initialize data from a file (synchronization problems and so on). It is better to create a single module that reads all the data from the file.

Passing groups into a module can create performance problems. It can have a significant effect on performance if you have a complex group. You can get around this by not adding +req, +read or +write onto the group. Strangely enough this seems to happen only on Unix platforms and not on PC.

Use instancer (<instanced=0> + instancer module) to speedup application startup. Remember that is better to instance only the first time (active = 2) and then set visibility on/off on the instanced object.

Remove status update (set Scheduler.status_check = 0;). Possibly re-enable it in modules that perform long computations to update the status bar and/or to accept interrupts.

If a computation result is used more than once use the cache() V function around it.

This function instructs AVS/Express to cache the result of the enclosed expression, so that the expression is recalculated only if the objects referenced in the expression have changed. When an object is connected to an expression, AVS/Express by default recomputes the expression whenever the object’s value is requested.

Remember that cache only works on "prims" (float, int, etc.). You can use it on arrays of prims, but *not* on groups (i.e. fields).

Caching involves a trade-off between the time required to recalculate an expression and the space required to cache the result.

Reduce data copy between module input/output objects.

Examples of modules that ALWAYS make a new copy each time:

downsize
orthoslice
crop
interp_data

Examples of modules that reference input data:

surf_plot
extract_component
scale

Hint: write your own special case modules that avoid the copy if the output size has not changed, or if any part of the input data can be referenced. Use the merge() V function to recreate the output data.

Use #define and $include in V files instead of subobjects with a predefined and unmodifiable value (like float PI = 3.1415;).

Check for unneeded method firing. Check +notify usage on method parameters. Use verbose mode to investigate misfiring. Check usage of return status from methods (zero to avoid firing downstream modules).

Use: input => switch(enable_module, input_data); and visible => enable_module; in the output DataObject to prevent unneeded firing of modules with invisible output.

A better method is to duplicate the module definition adding an int defined as:

int+req+notify locked; /* if undefined lock the computation of the module */

If this parameter is undefined, the module does not fire.

Beware that in a multi-viewer application you can have strange interactions between different UItwoPoints / GDtrack_edit modules that manifest itself as inexplicable application freezing.

In general creating and destroying arrays of arrays of arrays is an EXTREMELY slow thing to do in AVS/Express. I don’t know why, it just is. I had a project where I was to read in the customer proprietary data structure that consisted of arrays of structures containing arrays of structures containing… Just setting the initial size of the various arrays (which does the actual allocation in Express) took 5 minutes (no they were not particularly complex - I just had a lot: approx. 500).

Here’s my nominee for worst performance killer…

Doing a lot of array manipulation in V. If you have V code that contains chains of concat_array, interleave_array, etc., performance can be dreadful. The OM is written to optimize for storage space at all cost. Every time you access an array element, the OM follows the chain of references back to the original array. This can be a tremendous overhead if you are repetitively accessing the array elements. Either write a module to do the array manipulation or use the cache() V function.

Try to block View update till all the data is available setting View.View.mode => to a logical expression so that the View update is automatic when all push/pop have been executed (i.e. when in idle state) and is set to manual when there are things being done.

Modules such as parse_v are very useful, they are however rather trigger happy. For example parse_v will run as soon as its trigger input changes regardless of old and new values. Since AVS/Express update events are hard to control one may often get the situation of a trigger input changing from 0 to 0. Often one wants a value to trigger a module rather than a change itself. Most of my own modules that require a trigger contain the line:

if(!trigger) return 0;

preventing them from running if the trigger input transitions to zero. Copy_if module solves this problem and more
(this link disappeared: http://homepages.ihug.com.au/~klester/information/AVS/copy_if.htm).

Move from V code to C code.

With large array of grids, AVS/Express’ V scripting language does not always provide enough performance. Creating networks of V objects many of them with the automatic dimensioning and indexing of arrays is very convenient for rapid prototyping. However, the overhead of object instantiation can be too much. When this is the case, one can collapse all implementations of parallel data structures into single V objects. Their executing method loops over all data in a compiled subroutine. The software development is a lot more engaging, but we have already seen improvements of orders of magnitude in execution time.

In the application I looked at, a single module method was changed to no longer return an error code (i.e. 0) in a common situation. That single change stopped the application from going into a near-infinite loop of dependency checking.

Consults in the user guide the second paragraph under "Target Function Execution". Express’ dependency checking is significantly more efficient if the network contains mostly valid data. Once a node (meaning a notifier object, which maps nearly 1-to-1 to module methods) in the dependency graph is established as valid, then Express no longer needs to visit the sub-dependents of a node. However, the equivalent is not true for invalid nodes. When an invalid node is visited, Express will check all of the node’s sub-dependents, even if the node was visited before.

Prefer binary data files. They create portability problems etc. but are much more compact than ASCII files and so the read phase is quicker.

Field Data

Reduce the number of cell sets used. The cell set list traversal for rendering is expensive. You can use the IAC CatCellSets project (it is used by the DXF Reader) to combine cell sets with the same cell type. Unfortunately it forgot to copy cell data. Use MergeCellSets instead (available from Mario Valle).

If it is possible force min and max values in field if they stay constant. Modify also min_vec, max_vec to predefined values.

Investigate if a big unstructured field can be replaced by an array of fields with some of them uniform or structured.

Use symmetries in the data to reduce computation. Use periodic extension (cloning and translating an object using references) to have better visualization without duplicating computations.

Try multiresolution grids like the IAC HiVis project.

DVcombine_sets_ARR outputs a single field that contains the entire cell set data from all of the fields in its input field array. The single output field created by DVcombine_sets_ARR can be rendered using a single GD.DataObject. This speeds up rendering, but more importantly, allows you to construct networks that process multi-block data without creating arrays of GD.DataObject objects.

Simplify

DataObjects (and GroupObjects) are expensive, use them sparingly.

The typical reasons for eliminating DataObject are:

To reduce object count in the OM. This will improve the load time of an app.
To keep methods inside the DataObject from running when the data changes. Typically, this is the datamap’s update method and GDminmax_update method. These can take time if there is a large data set.

Otherwise create lightweight DataObject leaving out for example:

DataMap & MinMax
Texture
AltObj & instancer

Yes, there are DataObjectLight, DataObjectNoTexture. But sometime you can be more aggressive in optimization or you should use a blend of DataObject and DataObjectLight (ever seen a completely blue object moving to a DataObjectLight?).

Remove Datamap, Props, Modes etc. from any GroupObject. They are used only in special situations (e.g. to set the same Datamap on a bunch of fields).

Beware of heavy modules like Axis3D.

Use $count_objs (or better $dcount_objs that counts also descendants) to focalize optimization efforts on areas containing huge amounts of objects. Remove all unused objects. As a rough rule of thumb, each OM object takes up 100 bytes.

Create "light" versions of standard objects (like Cross2D, point_mesh) removing error checking and output DataObject’s.

Use DV modules. Remember that some DV modules are really macros (e.g. DVdownsize) that select the real DV module based on the input field type. If this is known in advance use only the required DVdownsize_* getting rid of the DVmatch_fld computation.

Use bare bones viewers (built from UIrender_view+Mscene) instead of the standard Uviewer. Don’t include unneeded editors in the viewer.

Use the new Fast_ARR modules instead of the old _ARR one.

Arrays of objects are really handy to speedup development. Be careful to use them only for small array dimensions, otherwise the object count explosion will make them unusable. This is what happened with the old ARR modules.

Use TileRenderer instead of Orthoslice if possible.

When to use Orthoslice?

irregular mesh
not slicing first axes
not speed critical

When to use TileRenderer?

2D slice display from 3D uniform volume
speed critical slicing
need to avoid copy operation, memory allocation
more than one slice at a time

Sometimes you can use Glyph or GeoGlyph instead of multiple DataObjects.

Investigate IAC counterpart of standard modules like:

FastGlyph (is at least 4 times faster than the standard glyph macro)
FastAdvector
CleanExtEdges (remove duplicated edges)

The surf_optimize module is used to create a coarse mesh in the regions where the surface of elevation changes gradually and a fine mesh in the regions where the surface changes rapidly. Increasing the tolerance results in a less accurate mesh containing fewer triangles and thus leads to a corresponding improvement in rendering performance. The surf_optimize module does not produce the elevation surface itself, it just creates a flat triangular mesh that can be extruded into a surface of elevation produced using the surf_plot module (see surf_plot). The algorithm used in this module is described in: Automatic Generation of Triangular Irregular Networks using Greedy Cuts by Claudio T. Silva, Joseph S. B. Mitchell and Arie Kaufman, Visualization 95 Proceedings.

If a macro is not reusable don’t bother with links on input and output ports. Well, this is a really tiny change with vanishing returns.

User Interface Suggestions

Pixmaps on buttons are expensive.

Delete all unused UI items in Viewers etc. Avoid intermediate UI panels if not strictly needed.

Don’t use the trick: UIpanel UImod_panel; to transform modules with UI (like ClickSketch, Axis3D, TextString) into DV-like modules. Rebuild them from scratch.

You can use the morph object to create a UI object that switches configuration based on the setting of the morph_type parameter. This could have been implemented by switching the visibility of the widgets as well. The advantage of using the morph object is that only the widgets that are in use are instanced at any given time. For larger examples, this can improve the performance of the system as it takes some time to instance UI objects even if they are not visible.

One way to reduce your UI object count without losing any functionality in your user interface is to reuse user interface components.

Rendering

The first thing to check is your display setup. On Unix you can go into the scene and set the visual to an 8-bit pseudo color visual even if the default visual is 24 or whatever.

Remember that the default visual used by the viewers is best. Depending on the type of accelerator board used, you may get a 24-bit visual. In any case, you can explicitly request an 8-bit visual by setting the vclass subobject in the virtual palette to 3.

The dithering and cube subobject works only if the visual is 8 bits. That is PseudoColor.

Valid values for vclass are:

1 = default visual
2 = "best" visual (left up to renderer to choose)
3 = pseudo color
4 = direct color
5 = true color
6 = use X11 standard colormap (valid only on Unix)

On PC’s you get only one visual, which is configured in the Control Panel -> Display -> Settings page. Make sure this is set to 256 colors. This is less desirable for 3D displays, but works fine for 2D, and you only have to push around 8 bits per pixel.

The next thing to do is set the data objects dither technique to ramp. This will avoid having to dither every pixel. This suggestion is useful only for image data.

All GDobjects dith_tech subobject is set to cube (0) by default. To get the datamap to be converted to a ramp, you need to set this subobject to ramp (1) for the image you are trying to display.

Remember that the cube size is satisfied first. So if cube size is large there may not be many colors left over for any ramps. To increase the number of colors for ramps, decrease the cube size subobject in the virtual palette. Valid values are from 2 to 6. This ought to correct the dither problem and use smooth colors.

At this point, with a pseudo color visual, a small cube size and an object displayed with a ramp, by selecting the image and using the datamap editor, you can dynamically change the colors in the objects’ ramp in the HW color table. Be sure to set the Immediate toggle (i.e. no cache).

If image rendering speed is critical set interp_type = 0 (Point) instead of Bilinear. Remember also that with OpenGL rendering only Point and Bilinear interpolation types are available.

Once the object is built assuming the cache is enabled - the default - all transformations that take place just cause the GD to say re-execute display list.

Let’s tackle what causes us to recalculate the display list first. Here are some things that cause the display list to be recalculated:

the input data changes
the rendering modes change

Those are the easy ones that apply to all renderers. In some renderers, some of the properties may cause the display list to be recalculated. For example, in the OpenGL renderer, changing object opacity causes display list recalculation since the OpenGL API is brain dead.

OpenGL implementations typically have a large overhead for display list creation/deletion. If you have a geometry that will be relatively unchanging - use display list mode as the performance will be higher - at the added cost of an extra copy (i.e. the display list). However, if the geometry is constantly changing - perhaps doing some form of animation - it can actually be faster the use immediate mode (i.e. not cached).

Also remember that connecting the same DataObject to a 3D and a 2D viewer can interfere with the cache functioning on the 2D viewer.

When possible enable backface culling (eg for closed objects). Setting cull mode (under geometry attributes) to cull back increases rendering speeds (the teapot rotation test frame rate goes +24%).

Set the correct dimension for the object cache (the default is usually insufficient). Remember to set the value also on the alternate object if it is used and it is complex. For flexibility set the value using an environment variable (cache_size => getenv("APPL_CACHE_SIZE")) that can be set in the avsenv file.

Remember that the cache_size is a maximum value. The memory is allocated only when needed. So it is harmless to set it to a big value.

Turn off AutoNormalize if it is not needed.

Turn off Normals generation if surface rendering is Flat Shading or No Lighting. If you need or accept a faceted look experiment with Normals=None+FlatShading vs. Normals=None+Gouraud. Sometimes the second alternative looks better.

Sometimes the gained performance without Normals generation counterbalances the quality loss.

Turn off Normals generation for points (as pixel) and for simple glyph like Diamond3D.

Use Alternate Object (maybe with a simplified version of the object) to speedup the interactive manipulation of the object.

Set "Force 2D" especially for alternate object wireframe or bounding box. It is enabled setting Obj.space to "Force 2D" (instead of "Match Camera"). Beware that on PC this works only with software renderer.

Use the correct Field Conversion type. Remember that this object (surf_conv) controls how a surface is convered to a triangle strip. This feature is supported when the cell type is triangle, quad, or polyhedron. The modes are:

Simple (0)
Optimal (1)
None (2)

Simple:

In simple surface conversion, a triangle strip is constructed by adding zero-area triangles. In a typical case, 10,000 triangles are converted into a triangle strip that has 60,000 triangles in the strip.

Optimal

The optimal case rearranges the list of cells into a triangle strip that is a more efficient representation of the surface. In a typical case, 10,000 triangles turn into a triangle strip that is 20,000 long. Note that the optimization provided by the optimal conversion depends on the existence of shared nodes in the input field.

While the triangle strip can be considerably shorter when using the optimal surface conversion, this manner of surface conversion takes extra time and memory. The amount of temporary storage used by the optimal conversion can roughly be calculated by multiplying the number of triangles by 240. The result is the number of bytes. For example, if you have 10,000 triangles, the temporary storage requirements are 2,400,000 bytes or about 2.3 Mbytes.

Optimal surface conversion is faster with smaller chunk sizes. It is worth experimenting with these controls to determine the configuration that provides the best speed/memory tradeoff for the data sets you are rendering.

None:

This setting supports the direct rendering of triangles and quad surfaces (available with OpenGL 1.1) without converting them to triangle strips.

This mode reduces memory usage and can improve performance (e.g. when the data is changing).

You can control the amount of memory used by the first two conversion modes. The parameters controlling this are:

chunk:

This parameter controls whether chunking is enabled. 0 means disable; 1 means enable. Chunking allows you to control the temporary storage use during surface conversion. If chunking is disabled, the whole surface is converted at once. If chunking is enabled, the surface is converted in chunks as specified by the surface and line chunk sizes. The size is the number of primitives to convert at a time.

surf_chunk:

This parameter controls the number of cells to process at once when chunking is enabled and a surface rendering mode is selected.

line_chunk:

This parameter controls the number of cells to process at once when chunking is enabled and a line rendering mode is selected.

surf_subdiv:

This parameter controls whether quad cells are subdivided and to what extent. Valid values are in the range 1 to 4. A value of 1 means that no subdivision will be done. Values of 2 through 4 mean that each quad cell will be divided up to the number specified if the node data values at the vertices of the quad cell vary too much. For a value of 2, the values may vary up to 50% of the total range of the data. For a value of 3, the values may vary up to 33% of the total range of the data. For a value of 4, the values may vary up to 25% of the total range of the data.

Use accelerate mode if you have a small object that moves over a large one. To activate mark the small object as dynamic. The big one is marked static by default. Then set accel=1 on the DefaultView.

Test the speed gained. On some card (notably PC) accelerate mode is slower than normal mode (the example Accel2Dglyph is about 2% slower).

Set pickable = 1 only on objects you really need to pick. Investigate the setting that restricts the picking to only the required cell set types.

Use draw_mode XOR for lines that move over a background object. This avoids rerendering of this one each time the line is moved.

Texture: on low level PC I have observed a rendering speed doubling using tile = Wrap (instead of Clamp) and Blending = Replace, Type = Single level.

Type = Mip-map severely hurts performance.

Always set Blending = Replace.

For better OpenGL rendering performance don’t use DepthCueing.

Use a single Directional light. No bidirectional, point or spot lights. (Teapot test: -5% with bidirectional light).

Use antialiased lines only if they are supported by hardware.

If you have complex transparent objects and you can trade visual quality for performance, try to disable two-pass transparency (setting the environment variable XP_2PT_DISABLE) and check if this improves performance.

DVthresh_null removes cells where the selected component’s node data is equal to the null data value; if the null_data flag is not set on the input field no processing is done. You would use DVthresh_null, for example, to improve the performance of the renderer when rendering fields with NULL data. While the renderer will automatically delete NULL data cells, it must do so every time the object is transformed. With DVthresh_null as a data "preprocessor," this overhead is removed.

Use Viewer3D for 2D objects only when needed. Remember that some techniques now work also on a 2D viewer (i.e. transparency and NULL data).

Set the 2D viewer to OpenGL (remember that the default is software render).

Use smallest acceptable 3D renderer window, especially with software renderer.

Enable the viewer Frame Buffer Output only if needed. Its update kills performance.

Try to use stroke text in TextGlyph. This should speed up your object manipulations.

Saving Memory

float test[1000] => 12.34;

This will at evaluation time return the value 12.34 as the array value for each cell. AVS/Express does NOT allocate the full array size thereby saving memory. But beware because sometimes it does not pass notifications.

Use links instead of &group_ref as macro input. It reduces memory usage.

In module input parameters remember to use reference mode (&) on structures and arrays.

To release the memory in situations similar to the following:

group Test {
     int NPP;
     float Coords[NPP][2];
};

Try NOT setting NPP to zero. Instead do:

Coords.set_array(OM_TYPE_FLOAT, NULL, 0, OM_SET_ARRAY_FREE); or you could always use OMparse_buf with "Test.Coords => ;"

When you dimension an array based on a variable, the Object Manager automatically redimensions the array based on the value of the variable.

However, when you change the value of the variable, *nothing* happens immediately to the array. The array is not redimensioned until you reference it. If the variable changes value ten times before you reference the array, the Object Manager only has to redimension/reallocate the array once.

Remember that Scalar floats are stored as doubles. Instead Arrays of floats are single precision floats.

If in your code you have something like: system("rm -rf file.dat"); watch out! Unix will do a fork and an exec to spawn off a new shell. For a brief moment, after the fork, but before the exec, you will have TWO copies of your Express app in your virtual address space.

To avoid crash due to memory exhaustion, the safest way to go is to use the low-level system calls; for example, use "unlink" to remove files. Another way to go is to put this kind of stuff in the "user" process and make sure the "user" gets spawned off early before the Express app is bloated with the dataset you are working with.

On some Unix platforms there is a vfork() call that avoid memory duplication.

Specific Applications

Movie play

Set the GDview’s buffer flag to 0 (i.e. set single buffer mode). When used with GDview clear = 0, the view window is not cleared between frames and any object is rendered directly to the window.

I just checked on my machine and appear to be getting 20 to 30 frames per second. Obviously this is with nearest neighbor interpolation (Point). Bilinear interpolation is around 1 fps.

For best performance when slicing through a volume (cine-ing), make sure the images are contiguous in memory (within each image).

Volume rendering

Use BTF if the card supports it. The only place I believe that we are at all sub optimal with the BTF approach is when the volume does not fit entirely into texture memory.

An important thing to note, though, is that the default setting for the alpha channel of DataObjects in AVS/Express is not very suitable for volume rendering. You’ll get much better quality and much more detail in most cases if you set the objects you render to use a ramp in the alpha channel instead of constant alpha.

Unfortunately, using an alpha ramp has a severe impact on rendering performance, which is getting quite bad then for a real dataset - you effectively disable at least one of the tricks AVS/Express is using to speed up the rendering. Of course, the effects you’ll get when playing around with alpha are even more pronounced if you happen to be using an algorithm where the alpha value goes into the equations used to calculate the result of the ray casting (Direct Composite, for instance).

Imaging applications

On PC try to reduce color depth. On Unix set vclass = 3, dithering = ramp (Make sense only with 8bit visuals) and cube size = 4 or 5.

If image rendering speed is critical set interp_type = 0 (Point) instead of Bilinear. Remember also that with OpenGL rendering only Point and Bilinear interpolation types are available.

Finally, and this is only if you have hardware texture mapping, you could texture map the image onto a 2x2 uniform field. If the texture mapping is good, you may get very good performance.

Special Techniques

Bypass Renderer with Render-methods. For example you can use OpenGL routines on the AVS/Express viewer. Examples are available (fastquads and vf_image).

Use surf reduction and downsize modules to create e.g. an alternate object or when you need quick interaction and can tolerate a longer setup.

If you have a fixed geometry try to save it in a file and reload it at runtime without recomputing it every time.

Use the GD API (e.g. GD2d_draw_line()) to write simple geometry on a (2D) view instead of creating a mesh.

Use illuminated_line animation effects instead of particle_advector.

Use 3D texture if your hardware supports it (Excavate_brick3D do this). The speedup happens only if the full volume fits in texture memory.

Use node data with id = 668 for sphere glyphs instead of Sphere+Glyph modules. This is especially efficient with OpenGL renderer.

Consider memory-mapped files for file access. Mmap() returns a pointer to the data, however the data is not read into memory, beneath the surface kernel I/O functions access the data. In Express you need to use the OM_SET_ARRAY_STATIC flag on the allocation routine. Therefore you can read in data sets the size of your virtual address space without needing that much memory. Of course performance is left to disk speeds.

You have an area of coverage (say imagery and elevation data) that is huge in xy dimension and very high resolution. What you want to do is look at the whole image in reduced resolution mode, and then as you zoom in, you increase the resolution of the image you use. This is often called mipmapping. There are several problems. One is the large data size. Another is figuring at which point to switch resolutions. One of the biggest is how to make the transition invisible or even subtle…

You preprocess the data and producing "power of two" downsampled versions of each image. Since the first downsized version is .25 as large as the original and the second level is .25 * .25 = 0.0625, etc., the entire collection is just over 1.25 times as much data as the original image. So you get the benefits of faster image reading (because you only read the appropriate resolution for the scene you’re rendering) at not much storage cost.

Drill down into the Viewer.Scene. View and you’ll see an input called “buffers”. This allows you to pass in an image and/or frame buffer for the View to be initialized with.

Use the scene buffer mechanism to pass rendered geometry between viewers (GDkit manual page 6-172).

For example passing the contents of the frame buffers between views can be used to chain renderers. For example, on a machine that does not perform texture mapping in hardware, you could render the texture mapped objects using the software renderer, then pass the frame and Z-buffers, along with the non-texture mapped objects in the scene, into a scene set to use the hardware renderer. Note that in order to synchronize the two views the top transforms of the two views would also need to be connected in the network. Such an application allows you to take advantage of hardware performance when transforming non-texture mapped objects, where otherwise everything would have to be software rendered as long as any texture mapped objects existed.

Enable the Frame Buffer Output only if needed. Its update kills performance.

Application Build

Be sure to remove debug flags. This is especially true with MSVC++. The default AVS/Express .dsp project has no optimization set. If you use express.mk define G=/O2 /Ob before running nmake. If you use express.dsw set manually the flags.

Remember also that the runtime generation don’t recompile the sources, simply links the development environment .obj’s. So recompile them with optimization before creating the runtime

Comments out debug code in the release application. Remember to remove all debug printf(). On PC platform sometimes AVS/Express freeze until you press RETURN in the VCP if there are pending printf() writes.

Be sure to relink everything into the express process. Obviously the real performance test is with the runtime application.

Of course delete kits you are not using. Remember that there are bugs related to the disabling of DBkit and AGkit on PC platform. Use a rebuild script to remove unneeded references. Also use a script for runtime build that removes unneeded libraries.

Under Windows there are the commands walign.exe and winalign.exe (from Win98 Resource Kit) that change the executable layout to speedup application loading. Try them, but normally with a modern computer configuration the effect is barely noticeable.

Speeding up startup of apps with many DSOs on SGI (ftp://ftp.sgi.com/sgi/dev/davea/software.html)

For folks building many DSOs (shared libraries) that all link together (say close to 100 DSOs) it may be true that some symbols can be “hidden” and that that will speed up application load time. It can only help if one builds (and can thus safely rebuild) many DSOs. System DSOs and DSOs one do not build oneself doesn’t count. If the DSOs are built entirely or substantially with C++, there can be many hidable symbols.

While this note does not discuss -Bsymbolic (an ld option) you should also try
-Bsymbolic in your DSO creation, as that can (for an app with a large number of DSOs) have the effect of dramatically reducing startup time. -Bsymbolic makes “preemption” impossible (“man dso” for more info), but for most developer-created DSOs preemption of its symbols is never used anyway. Using -Bsymbolic and hiding all possible symbols in creating a given DSO is a good idea too. C++ symbols relating to RTTI (run time type information) must not be hidden.

You will want to pick up commentary and shell scripts to understand the details (in a shell archive). Last updated June 30, 2000. (ftp://ftp.sgi.com/sgi/dev/davea/whattohide.shar)

Second, pick up the optionalsym program (33025 bytes) (source in a shell archive) as one part of verifying that the approach actually saves time. (ftp://ftp.sgi.com/sgi/dev/davea/optionalsym.shar)

Development Environment Speedup

Remove module flashing. You can add at the beginning of your V file the following line: NetworkEditor.optionsMenu.flashingItem.option.set = 0;

Use precompiled V files (.vo). Define the libraries as: "modules" MyModules;

Beware: during development it is better to specify the .v extension in the construct above to avoid problems due to misalignment between .v and .vo files.

Set <compile_subs=0> on macro only libraries.

Use flibrary+buffered to speed up load time (only if there are modules not immediately loaded into the application).

Create runtimes outside AVS/Express with the $save_compiled_project method. Otherwise you double the memory needed for runtime generation.

On PC set the MSVC++ flags Incremental Compilation and Minimal Rebuild.

How to Measure Performance

To get V loading timing data:

$timer_start
    ...
$timer_get

Use $count_objs/$dcount_objs to focalize your object reduction efforts.

These commands can be useful to understand the number of objects currently created to support the specified object. These commands count only objects that have actually been created to support this instance and do not count objects that are currently inherited from base classes. This count can be less than the number of objects displayed by the $list command. The $count_objs command returns the number of objects defined in this object only. The $dcount_objs prints the total number of descendents from this object (recursively counting subobjects as well).

Verify memory usage ($set_arr_trace from VCP or setting ARR_TRACE=1 before starting AVS/Express). Use tools to find memory leaks (Purify or Heap State Reporting Functions in MSVC++).

Individual objects can be traced on the operation specified adding the +trace attribute on it and then executing $set_trace <flag1> <flag2>... <flagn> where flag are:

all	all operations
destroy	the destroy_subobj operation
subobj	add_subobj or del_subobj
set_val	any set value operation
get_val	any get_value operation
notify	any add or delete notify operation
invoke	the invoke operation on methods
compile	the compile operation
array	trace array alloc’s and free’s

Especially the invoke flag is useful for diagnose unneeded method firing without the information overload from set verbose.

Use verbose mode to investigate misfiring. Use arguments to restrict the kind of events you want to see. To speedup switching create two V files:

on.v containing:

$set_verbose_fp express.log
$set_verbose functions events wide

and off.v containing:

$unset_verbose all
$set_verbose_fp close

so from the VCP you can collect only the interesting trace $including the correct file.

Also $notify [obj] can help to understand what notifies a particular object.

Before blaming AVS/Express use profiling or instrumented code in a slow module to check if the delay is in the user code or in AVS/Express module usage.

The Instrumentation IAC project contains a number of modules designed for development performance tuning and debugging in AVS/Express applications. These modules are relatively small are intended to be added in specific places within an application to report time, memory information, object usage or module execution when triggered. They can either be dropped into running networks to report activity using the network editor, or can be left inside an application for use in a runtime for benchmarking.

The following modules are provided in the Instrumentation project:

time_activity	Records the time and change in memory usage between a trigger event and when the application returns to a "idle" state.
time_on_off	Records the time and change in memory usage between a start trigger event and a end trigger event.
object_stats	It surveys all modules at the same level as its peers and generates a status report of object names with object counts.
gated_verbose	When the module is enabled and the trigger source event fires, this module temporarily turns on "Verbose Functions" reporting. As soon as all events currently in the queue to fire are processed and the Express OM returns to the "idle" state, verbose function reporting is switched off.
gated_trace	When the module is enabled and the trigger source event fires, this module temporarily turns on "ARR Trace Enable" reporting. As soon as all events currently in the queue to fire are processed and the Express OM returns to the "idle" state, ARR trace reporting is switched off.

Points to be investigated (and maybe added)

Copy of data between OM and module code. Investigate OMXobj::get_array() found under OMget_array help page and OMXobj::set_array() found under OMset_array. OK, OMset_array seems useful if you need to "reallocate" the array inside of a loop. For example, something like:

while (not end of file) {
       read another value
       reallocate the array to expand to the new size
}

OMset_array(...) /* set the array’s values */

You can use OMset_array with the mode flag OM_SET_ARRAY_STATIC to get the Object Manager to directly access an array in a Fortran common block for example. Does this reduce memory copies between user code and Object Manager?

GeomAPI routines usage. Are they useful to speedup the application?
Rendering a structured mesh is more efficient if cache is disabled (GDkit manual page 4-4). Check.
Tiled field rendering. Any ideas of what are they? (GDkit manual page 4-9).
Buffer mode: double (MBX) (Multi-Buffering Extension (MBX)) and double (Pixmap) [valid only for sw renderer?]. What is the difference performance wise?
XP_NO_TEX3D gd\ogl\ogl_inq.c // disable Texture3D OGL extension
Check why suggested to use polyhedron cell for efficiency.
Why PC rerenders scene on occlusion?

OM:executing: (seq: 69271)

func   :     SingleWindowApp.Uviewer3D.Scene.View.View.win_func
arg    :     SingleWindowApp.Uviewer3D.Scene.View.View
events :     value changed
cause  :     SingleWindowApp.Uviewer3D.Scene.View.ViewUI.ViewWindow.handle.event
done   :     SingleWindowApp.Uviewer3D.Scene.View.View.win_func

It happens only with OGL renderer, not sw renderer. Seems that the front buffer is cleared before recomputing back buffer.

Provide unit length normals instead of leave OGL compute them.
If I have:

int a;
int b;

Which is faster b => a or b => .a ?

I presume the later since the former means to search for a in this scope and upwards. But how much of an overhead. The reason I ask is that I hand code my v and just write b=>a if a and b are in the same scope (I always use <-’s if not) and some of my applications are running at >10,000 lines of v in total now.

library+global+sort Basic<indexed=1> to create binary buffered libraries. Cannot buffer a library only if they have virtual data.
Define OM_NO_ERROR_CHECK to remove checks for infinite loops (after the network has been debugged obviously!).
From OpenGL performances: MipMap textures nearest.
In Express 6.1 if available the OpenGL extension GL_EXT_rescale_normal has been enabled also on PC. It can be disabled defining the environment variable XP_OGL_NO_RESCALE_NORMALS.
This is a really strange behavior. Can be worth investigating if a downstream module runs slowly then expected.

In the upstream module when the definition:

     fields<NEportLevels={0,3}>;   // fields is an array of fields
};

is changed to:

     fields;    // fields is an array of fields
     mlink+OPort2 olink => .fields;
};

An average run of the downstream module takes now 65 seconds instead of 114!

If you’re rendering the data, unstructured Fields are less compute intensive than uniform Fields because all of its coordinates are there; they do not have to be computed.
If you’re visualizing the data (slicing or isosurfacing it, for example), uniform Fields are faster because you don’t have to do things like compute nearest neighbor (because it’s the next element in the array).
Use textures instead of default vertex generation if the video card supports them.
Turn off cache for images. Why?