Home

Roger Riordan: Career

 

I wrote the first version of VET in April 1989, when the ‘Stoned’ virus appeared in the computer labs at the Chisholm Institute of Technology.  The virus was not supposed to be harmful, but we had one lab full of Olivetti PCs, which had a nonstandard BIOS, and when the virus infected them it overwrote the FAT table (which defined where all the files were stored on the hard disk), and effectively destroyed everything on the disk. 

It took the technicians 20 minutes to reinstall everything, and the next student could destroy it again in 30 seconds, so we had a problem.  We found one Israeli program which supposedly removed the virus, but it cost $400 per PC, which was out of the question for us, and in any case I realised that unless we could clean up the student's computers we had no hope of cleaning up our own.

I got a sample of the virus, analysed it, and worked out how to remove it.  I called my program VET, because you VET a document for errors, and I gave to the students as shareware.  It detected just this one virus, but within a week we had detected three or four more, and we added them as we found them. 

For a while things progressed at a manageable rate, with a new virus every few weeks, but then things started to hot up, and by 1991 we found ourselves trying to deal with several new viruses a day.  There were already a few special cases, but for most of the viruses we were able to choose what we hoped was a unique search string, and added it to a table.  Then, every time we wanted to check a file, we would load a suitable chunk of it into memory, and check it against all the templates, one after the other.

As the number of viruses grew VET began to slow down, and I started looking for more efficient ways to conduct the search.

Eventually I hit on an algorithm which allowed me to search for all the strings simultaneously, using an extremely fast search procedure.  When I first devised the algorithm I found it hard to believe it would really work, but it did, and we immediately introduced it into VET.  I named the algorithm Polysearch, and I described it at a conference in New York in March 1992.1

Unfortunately I seem to have thrown out my original copy of the paper, and I have had to recreate the algorithm from memory.  I have not programmed in assembler for some years and had forgotten all the instructions, so I will not guarantee that my sample code will actually work, but it is near enough to give you the idea.

The general idea was that if I looked at the suspect file 8 words at a time, and I assumed that the words were chosen randomly (which, for machine language, was a decidedly rash assumption) each word could have 65,536 possible values, so that even if there were 10,000 search strings there was only a one in 6 probability that any given word would appear in any of the search strings at any given location. Both templates and the file being searched will generally contain microprocessor instructions, which are probably not very randomly distributed, and this would materially increase this probability.

If all eight words did appear sequentially in any of the strings, the procedure would give an ‘Alert’, and I would then have to compare the preceding eight words with each of the templates to see if it matched any given template.  This would be a relatively slow procedure, but if we assumed that the probability of any given word occurring in one of the templates was one in three, the probability of getting an alert would still only be 1 in 3, raised to the power 8, or about 1 in 6000. This is still a relatively low value, and would be unlikely to slow the search down significantly.

It was often difficult enough to find any search string, let alone one 8 words long, so I settled on taking a nine byte string, and treating it as eight overlapping words. Presumably this increased the probability of alerts, but it did not appear to significantly degrade the performance.

The challenge was to devise a way of quickly searching the target eight words at a time, and seeing if they appeared in any of the strings in the correct order.

I did this in several steps.

  1. Select all the templates, and sort them numerically.  This is done during the compilation of the program.

  2. Generate a 64KB ‘Translation table’.  This has a separate entry for each possible value of a 16 bit word.  To create the table each template is scanned in turn.  The first word value is used as an index into the table, and bit zero of that address is set.  The same procedure is repeated for each word in turn, but each time the next bit of the target word is set.  When this process is finished each byte in the table will have bit zero set if the corresponding word appears in the first position of any of the templates, bit one will be set if it's address is the second word in any of the templates, and so on. (In the original program this was done as VET was being loaded, but with disk space now costing essentially nothing, I expect that today the translation table would be supplied already compiled.)

  3. Load the first word of the target into a register (say BX) and shift a ‘1’ into bit zero of AL. Then, using BX as an index into the translation table, AND AL with the contents of the corresponding address in the translation table.

  4. Repeat this process to scan along the target until it has been it has completely scanned. At each step in the process, if the preceding eight words have all occurred in the right order in any of the strings a ‘1’ will be shifted out of bit seven of AL, and will set the carry flag, indicating an alert.

  5. Whenever an alert occurs search the template table to see if the preceding eight words correspond to any of the templates.  If a match is found it means that the target contains this template.  This search is much slower than the scanning process, but as it should only occur infrequently it does not significantly affect the overall search time.

The critical part of the procedure is the loop that scans the target string.  In the original program this was approximately as follows:

 

Begin:

Mov DS, Target_base_address

 

 

Mov SI, 0

; Point DS:SI at the target

 

Mov CX, Target_len - 1

; Enter length in CX

 

Mov AL, 1

; Set bit ‘0’ in AL

 

Mov ES, Txtable_base_address

; Point ES:BX at the translation table

 

Mov BL, [SI]

; Load first word into BL

 

Inc SI

; And advance SI to the 2nd byte

 

 

 

Repeat:

Mov BH, BL

; Load the next word into BX

 

Mov BL, [SI]

; (We are using overlapping words, hence the Mov BH, BL)

 

Inc SI

 

 

And AL, ES:[BX]

; AND AL with the appropriate entry in the translation table,
; clearing every bit for which this word does not appear in the corresponding position in any template.

 

STC

; Set the carry flag.

 

RCL AL

; and rotate it into bit ‘0’, while rotating bit ‘7’ into the carry flag.

 

JC Alert

; If the carry is set a ‘1’ has been shifted right through AL,
; indicating an alert.

 

Loop Repeat

; Decrement CX and repeat until the whole target has been scanned.

Done:

 

; Target is clean

 

 

 

Alert:

 

; Search the templates to see if the preceding 8 bytes match any template. If so raise the alarm, otherwise continue the search.

 

I have written this scrap of code in 8086 language. It could probably be shortened by one or two instructions using a more modern processor but even this version only has eight instructions in the critical inner loop, which is executed once for every byte of the target.  Despite this it could be searching for up to 10,000 templates, and is starting a new search on every pass.

At the time this algorithm gave a useful improvement in the performance of VET.  Its biggest weakness was that all the templates had to be the same length. Polysearch could readily be modified to use shorter templates, but this greatly increased the probability of alerts. Also, as viruses became more complex, it became increasingly desirable to include wild cards in the templates, and I never managed to devise a way for Polysearch to handle these.

The virus writers were also hard at work, and introduced first encrypted viruses, then polymorphic viruses, and then extremely polymorphic viruses.  The encrypted viruses were not too much of a problem, as we could usually extract a search string from the encryption algorithm, and for the moderately polymorphic viruses we could usually extract a search string for each alternative, but the extremely polymorphic viruses could take thousands of alternative forms, and we had to develop new techniques to deal with them.

Unfortunately these factors meant that, for all its elegance, the Polysearch algorithm proved to be of little long-term value to the industry.

 

1.  "Polysearch: An Extremely Fast Parallel Search Algorithm". Roger Riordan. pp 631-640. Proceedings of the Fifth International Computer Virus & Security Conference. 1992.

Roger Riordan. 17.3.2008

 © Roger Riordan 2004-2017

The Polysearch Algorithm