160662d10Schristos#! /usr/bin/env perl 2*1dcdf01fSchristos# Copyright 2005-2020 The OpenSSL Project Authors. All Rights Reserved. 3*1dcdf01fSchristos# 4*1dcdf01fSchristos# Licensed under the OpenSSL license (the "License"). You may not use 5*1dcdf01fSchristos# this file except in compliance with the License. You can obtain a copy 6*1dcdf01fSchristos# in the file LICENSE in the source distribution or at 7*1dcdf01fSchristos# https://www.openssl.org/source/license.html 8*1dcdf01fSchristos 960662d10Schristos# 1060662d10Schristos# ==================================================================== 11*1dcdf01fSchristos# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL 1260662d10Schristos# project. The module is, however, dual licensed under OpenSSL and 1360662d10Schristos# CRYPTOGAMS licenses depending on where you obtain it. For further 1460662d10Schristos# details see http://www.openssl.org/~appro/cryptogams/. 1560662d10Schristos# ==================================================================== 1660662d10Schristos# 1760662d10Schristos# July 2004 1860662d10Schristos# 1960662d10Schristos# 2.22x RC4 tune-up:-) It should be noted though that my hand [as in 2060662d10Schristos# "hand-coded assembler"] doesn't stand for the whole improvement 2160662d10Schristos# coefficient. It turned out that eliminating RC4_CHAR from config 2260662d10Schristos# line results in ~40% improvement (yes, even for C implementation). 2360662d10Schristos# Presumably it has everything to do with AMD cache architecture and 2460662d10Schristos# RAW or whatever penalties. Once again! The module *requires* config 2560662d10Schristos# line *without* RC4_CHAR! As for coding "secret," I bet on partial 2660662d10Schristos# register arithmetics. For example instead of 'inc %r8; and $255,%r8' 2760662d10Schristos# I simply 'inc %r8b'. Even though optimization manual discourages 2860662d10Schristos# to operate on partial registers, it turned out to be the best bet. 2960662d10Schristos# At least for AMD... How IA32E would perform remains to be seen... 3060662d10Schristos 3160662d10Schristos# November 2004 3260662d10Schristos# 3360662d10Schristos# As was shown by Marc Bevand reordering of couple of load operations 3460662d10Schristos# results in even higher performance gain of 3.3x:-) At least on 3560662d10Schristos# Opteron... For reference, 1x in this case is RC4_CHAR C-code 3660662d10Schristos# compiled with gcc 3.3.2, which performs at ~54MBps per 1GHz clock. 3760662d10Schristos# Latter means that if you want to *estimate* what to expect from 3860662d10Schristos# *your* Opteron, then multiply 54 by 3.3 and clock frequency in GHz. 3960662d10Schristos 4060662d10Schristos# November 2004 4160662d10Schristos# 4260662d10Schristos# Intel P4 EM64T core was found to run the AMD64 code really slow... 4360662d10Schristos# The only way to achieve comparable performance on P4 was to keep 4460662d10Schristos# RC4_CHAR. Kind of ironic, huh? As it's apparently impossible to 4560662d10Schristos# compose blended code, which would perform even within 30% marginal 4660662d10Schristos# on either AMD and Intel platforms, I implement both cases. See 4760662d10Schristos# rc4_skey.c for further details... 4860662d10Schristos 4960662d10Schristos# April 2005 5060662d10Schristos# 5160662d10Schristos# P4 EM64T core appears to be "allergic" to 64-bit inc/dec. Replacing 5260662d10Schristos# those with add/sub results in 50% performance improvement of folded 5360662d10Schristos# loop... 5460662d10Schristos 5560662d10Schristos# May 2005 5660662d10Schristos# 5760662d10Schristos# As was shown by Zou Nanhai loop unrolling can improve Intel EM64T 5860662d10Schristos# performance by >30% [unlike P4 32-bit case that is]. But this is 5960662d10Schristos# provided that loads are reordered even more aggressively! Both code 60*1dcdf01fSchristos# paths, AMD64 and EM64T, reorder loads in essentially same manner 6160662d10Schristos# as my IA-64 implementation. On Opteron this resulted in modest 5% 6260662d10Schristos# improvement [I had to test it], while final Intel P4 performance 6360662d10Schristos# achieves respectful 432MBps on 2.8GHz processor now. For reference. 6460662d10Schristos# If executed on Xeon, current RC4_CHAR code-path is 2.7x faster than 6560662d10Schristos# RC4_INT code-path. While if executed on Opteron, it's only 25% 6660662d10Schristos# slower than the RC4_INT one [meaning that if CPU µ-arch detection 6760662d10Schristos# is not implemented, then this final RC4_CHAR code-path should be 6860662d10Schristos# preferred, as it provides better *all-round* performance]. 6960662d10Schristos 7060662d10Schristos# March 2007 7160662d10Schristos# 7260662d10Schristos# Intel Core2 was observed to perform poorly on both code paths:-( It 7360662d10Schristos# apparently suffers from some kind of partial register stall, which 7460662d10Schristos# occurs in 64-bit mode only [as virtually identical 32-bit loop was 7560662d10Schristos# observed to outperform 64-bit one by almost 50%]. Adding two movzb to 7660662d10Schristos# cloop1 boosts its performance by 80%! This loop appears to be optimal 7760662d10Schristos# fit for Core2 and therefore the code was modified to skip cloop8 on 7860662d10Schristos# this CPU. 7960662d10Schristos 8060662d10Schristos# May 2010 8160662d10Schristos# 8260662d10Schristos# Intel Westmere was observed to perform suboptimally. Adding yet 8360662d10Schristos# another movzb to cloop1 improved performance by almost 50%! Core2 8460662d10Schristos# performance is improved too, but nominally... 8560662d10Schristos 8660662d10Schristos# May 2011 8760662d10Schristos# 8860662d10Schristos# The only code path that was not modified is P4-specific one. Non-P4 8960662d10Schristos# Intel code path optimization is heavily based on submission by Maxim 9060662d10Schristos# Perminov, Maxim Locktyukhin and Jim Guilford of Intel. I've used 91*1dcdf01fSchristos# some of the ideas even in attempt to optimize the original RC4_INT 9260662d10Schristos# code path... Current performance in cycles per processed byte (less 9360662d10Schristos# is better) and improvement coefficients relative to previous 9460662d10Schristos# version of this module are: 9560662d10Schristos# 9660662d10Schristos# Opteron 5.3/+0%(*) 9760662d10Schristos# P4 6.5 9860662d10Schristos# Core2 6.2/+15%(**) 9960662d10Schristos# Westmere 4.2/+60% 10060662d10Schristos# Sandy Bridge 4.2/+120% 10160662d10Schristos# Atom 9.3/+80% 102*1dcdf01fSchristos# VIA Nano 6.4/+4% 103*1dcdf01fSchristos# Ivy Bridge 4.1/+30% 104*1dcdf01fSchristos# Bulldozer 4.5/+30%(*) 10560662d10Schristos# 10660662d10Schristos# (*) But corresponding loop has less instructions, which should have 10760662d10Schristos# positive effect on upcoming Bulldozer, which has one less ALU. 10860662d10Schristos# For reference, Intel code runs at 6.8 cpb rate on Opteron. 10960662d10Schristos# (**) Note that Core2 result is ~15% lower than corresponding result 11060662d10Schristos# for 32-bit code, meaning that it's possible to improve it, 11160662d10Schristos# but more than likely at the cost of the others (see rc4-586.pl 11260662d10Schristos# to get the idea)... 11360662d10Schristos 11460662d10Schristos$flavour = shift; 11560662d10Schristos$output = shift; 11660662d10Schristosif ($flavour =~ /\./) { $output = $flavour; undef $flavour; } 11760662d10Schristos 11860662d10Schristos$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/); 11960662d10Schristos 12060662d10Schristos$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1; 12160662d10Schristos( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or 12260662d10Schristos( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or 12360662d10Schristosdie "can't locate x86_64-xlate.pl"; 12460662d10Schristos 125*1dcdf01fSchristosopen OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\""; 12660662d10Schristos*STDOUT=*OUT; 12760662d10Schristos 12860662d10Schristos$dat="%rdi"; # arg1 12960662d10Schristos$len="%rsi"; # arg2 13060662d10Schristos$inp="%rdx"; # arg3 13160662d10Schristos$out="%rcx"; # arg4 13260662d10Schristos 13360662d10Schristos{ 13460662d10Schristos$code=<<___; 13560662d10Schristos.text 13660662d10Schristos.extern OPENSSL_ia32cap_P 13760662d10Schristos 13860662d10Schristos.globl RC4 13960662d10Schristos.type RC4,\@function,4 14060662d10Schristos.align 16 141*1dcdf01fSchristosRC4: 142*1dcdf01fSchristos.cfi_startproc 143*1dcdf01fSchristos or $len,$len 14460662d10Schristos jne .Lentry 14560662d10Schristos ret 14660662d10Schristos.Lentry: 14760662d10Schristos push %rbx 148*1dcdf01fSchristos.cfi_push %rbx 14960662d10Schristos push %r12 150*1dcdf01fSchristos.cfi_push %r12 15160662d10Schristos push %r13 152*1dcdf01fSchristos.cfi_push %r13 15360662d10Schristos.Lprologue: 15460662d10Schristos mov $len,%r11 15560662d10Schristos mov $inp,%r12 15660662d10Schristos mov $out,%r13 15760662d10Schristos___ 15860662d10Schristosmy $len="%r11"; # reassign input arguments 15960662d10Schristosmy $inp="%r12"; 16060662d10Schristosmy $out="%r13"; 16160662d10Schristos 16260662d10Schristosmy @XX=("%r10","%rsi"); 16360662d10Schristosmy @TX=("%rax","%rbx"); 16460662d10Schristosmy $YY="%rcx"; 16560662d10Schristosmy $TY="%rdx"; 16660662d10Schristos 16760662d10Schristos$code.=<<___; 16860662d10Schristos xor $XX[0],$XX[0] 16960662d10Schristos xor $YY,$YY 17060662d10Schristos 17160662d10Schristos lea 8($dat),$dat 17260662d10Schristos mov -8($dat),$XX[0]#b 17360662d10Schristos mov -4($dat),$YY#b 17460662d10Schristos cmpl \$-1,256($dat) 17560662d10Schristos je .LRC4_CHAR 17660662d10Schristos mov OPENSSL_ia32cap_P(%rip),%r8d 17760662d10Schristos xor $TX[1],$TX[1] 17860662d10Schristos inc $XX[0]#b 17960662d10Schristos sub $XX[0],$TX[1] 18060662d10Schristos sub $inp,$out 18160662d10Schristos movl ($dat,$XX[0],4),$TX[0]#d 18260662d10Schristos test \$-16,$len 18360662d10Schristos jz .Lloop1 18460662d10Schristos bt \$30,%r8d # Intel CPU? 18560662d10Schristos jc .Lintel 18660662d10Schristos and \$7,$TX[1] 18760662d10Schristos lea 1($XX[0]),$XX[1] 18860662d10Schristos jz .Loop8 18960662d10Schristos sub $TX[1],$len 19060662d10Schristos.Loop8_warmup: 19160662d10Schristos add $TX[0]#b,$YY#b 19260662d10Schristos movl ($dat,$YY,4),$TY#d 19360662d10Schristos movl $TX[0]#d,($dat,$YY,4) 19460662d10Schristos movl $TY#d,($dat,$XX[0],4) 19560662d10Schristos add $TY#b,$TX[0]#b 19660662d10Schristos inc $XX[0]#b 19760662d10Schristos movl ($dat,$TX[0],4),$TY#d 19860662d10Schristos movl ($dat,$XX[0],4),$TX[0]#d 19960662d10Schristos xorb ($inp),$TY#b 20060662d10Schristos movb $TY#b,($out,$inp) 20160662d10Schristos lea 1($inp),$inp 20260662d10Schristos dec $TX[1] 20360662d10Schristos jnz .Loop8_warmup 20460662d10Schristos 20560662d10Schristos lea 1($XX[0]),$XX[1] 20660662d10Schristos jmp .Loop8 20760662d10Schristos.align 16 20860662d10Schristos.Loop8: 20960662d10Schristos___ 21060662d10Schristosfor ($i=0;$i<8;$i++) { 21160662d10Schristos$code.=<<___ if ($i==7); 21260662d10Schristos add \$8,$XX[1]#b 21360662d10Schristos___ 21460662d10Schristos$code.=<<___; 21560662d10Schristos add $TX[0]#b,$YY#b 21660662d10Schristos movl ($dat,$YY,4),$TY#d 21760662d10Schristos movl $TX[0]#d,($dat,$YY,4) 21860662d10Schristos movl `4*($i==7?-1:$i)`($dat,$XX[1],4),$TX[1]#d 21960662d10Schristos ror \$8,%r8 # ror is redundant when $i=0 22060662d10Schristos movl $TY#d,4*$i($dat,$XX[0],4) 22160662d10Schristos add $TX[0]#b,$TY#b 22260662d10Schristos movb ($dat,$TY,4),%r8b 22360662d10Schristos___ 22460662d10Schristospush(@TX,shift(@TX)); #push(@XX,shift(@XX)); # "rotate" registers 22560662d10Schristos} 22660662d10Schristos$code.=<<___; 22760662d10Schristos add \$8,$XX[0]#b 22860662d10Schristos ror \$8,%r8 22960662d10Schristos sub \$8,$len 23060662d10Schristos 23160662d10Schristos xor ($inp),%r8 23260662d10Schristos mov %r8,($out,$inp) 23360662d10Schristos lea 8($inp),$inp 23460662d10Schristos 23560662d10Schristos test \$-8,$len 23660662d10Schristos jnz .Loop8 23760662d10Schristos cmp \$0,$len 23860662d10Schristos jne .Lloop1 23960662d10Schristos jmp .Lexit 24060662d10Schristos 24160662d10Schristos.align 16 24260662d10Schristos.Lintel: 24360662d10Schristos test \$-32,$len 24460662d10Schristos jz .Lloop1 24560662d10Schristos and \$15,$TX[1] 24660662d10Schristos jz .Loop16_is_hot 24760662d10Schristos sub $TX[1],$len 24860662d10Schristos.Loop16_warmup: 24960662d10Schristos add $TX[0]#b,$YY#b 25060662d10Schristos movl ($dat,$YY,4),$TY#d 25160662d10Schristos movl $TX[0]#d,($dat,$YY,4) 25260662d10Schristos movl $TY#d,($dat,$XX[0],4) 25360662d10Schristos add $TY#b,$TX[0]#b 25460662d10Schristos inc $XX[0]#b 25560662d10Schristos movl ($dat,$TX[0],4),$TY#d 25660662d10Schristos movl ($dat,$XX[0],4),$TX[0]#d 25760662d10Schristos xorb ($inp),$TY#b 25860662d10Schristos movb $TY#b,($out,$inp) 25960662d10Schristos lea 1($inp),$inp 26060662d10Schristos dec $TX[1] 26160662d10Schristos jnz .Loop16_warmup 26260662d10Schristos 26360662d10Schristos mov $YY,$TX[1] 26460662d10Schristos xor $YY,$YY 26560662d10Schristos mov $TX[1]#b,$YY#b 26660662d10Schristos 26760662d10Schristos.Loop16_is_hot: 26860662d10Schristos lea ($dat,$XX[0],4),$XX[1] 26960662d10Schristos___ 27060662d10Schristossub RC4_loop { 27160662d10Schristos my $i=shift; 27260662d10Schristos my $j=$i<0?0:$i; 27360662d10Schristos my $xmm="%xmm".($j&1); 27460662d10Schristos 27560662d10Schristos $code.=" add \$16,$XX[0]#b\n" if ($i==15); 27660662d10Schristos $code.=" movdqu ($inp),%xmm2\n" if ($i==15); 27760662d10Schristos $code.=" add $TX[0]#b,$YY#b\n" if ($i<=0); 27860662d10Schristos $code.=" movl ($dat,$YY,4),$TY#d\n"; 27960662d10Schristos $code.=" pxor %xmm0,%xmm2\n" if ($i==0); 28060662d10Schristos $code.=" psllq \$8,%xmm1\n" if ($i==0); 28160662d10Schristos $code.=" pxor $xmm,$xmm\n" if ($i<=1); 28260662d10Schristos $code.=" movl $TX[0]#d,($dat,$YY,4)\n"; 28360662d10Schristos $code.=" add $TY#b,$TX[0]#b\n"; 28460662d10Schristos $code.=" movl `4*($j+1)`($XX[1]),$TX[1]#d\n" if ($i<15); 28560662d10Schristos $code.=" movz $TX[0]#b,$TX[0]#d\n"; 28660662d10Schristos $code.=" movl $TY#d,4*$j($XX[1])\n"; 28760662d10Schristos $code.=" pxor %xmm1,%xmm2\n" if ($i==0); 28860662d10Schristos $code.=" lea ($dat,$XX[0],4),$XX[1]\n" if ($i==15); 28960662d10Schristos $code.=" add $TX[1]#b,$YY#b\n" if ($i<15); 29060662d10Schristos $code.=" pinsrw \$`($j>>1)&7`,($dat,$TX[0],4),$xmm\n"; 29160662d10Schristos $code.=" movdqu %xmm2,($out,$inp)\n" if ($i==0); 29260662d10Schristos $code.=" lea 16($inp),$inp\n" if ($i==0); 29360662d10Schristos $code.=" movl ($XX[1]),$TX[1]#d\n" if ($i==15); 29460662d10Schristos} 29560662d10Schristos RC4_loop(-1); 29660662d10Schristos$code.=<<___; 29760662d10Schristos jmp .Loop16_enter 29860662d10Schristos.align 16 29960662d10Schristos.Loop16: 30060662d10Schristos___ 30160662d10Schristos 30260662d10Schristosfor ($i=0;$i<16;$i++) { 30360662d10Schristos $code.=".Loop16_enter:\n" if ($i==1); 30460662d10Schristos RC4_loop($i); 30560662d10Schristos push(@TX,shift(@TX)); # "rotate" registers 30660662d10Schristos} 30760662d10Schristos$code.=<<___; 30860662d10Schristos mov $YY,$TX[1] 30960662d10Schristos xor $YY,$YY # keyword to partial register 31060662d10Schristos sub \$16,$len 31160662d10Schristos mov $TX[1]#b,$YY#b 31260662d10Schristos test \$-16,$len 31360662d10Schristos jnz .Loop16 31460662d10Schristos 31560662d10Schristos psllq \$8,%xmm1 31660662d10Schristos pxor %xmm0,%xmm2 31760662d10Schristos pxor %xmm1,%xmm2 31860662d10Schristos movdqu %xmm2,($out,$inp) 31960662d10Schristos lea 16($inp),$inp 32060662d10Schristos 32160662d10Schristos cmp \$0,$len 32260662d10Schristos jne .Lloop1 32360662d10Schristos jmp .Lexit 32460662d10Schristos 32560662d10Schristos.align 16 32660662d10Schristos.Lloop1: 32760662d10Schristos add $TX[0]#b,$YY#b 32860662d10Schristos movl ($dat,$YY,4),$TY#d 32960662d10Schristos movl $TX[0]#d,($dat,$YY,4) 33060662d10Schristos movl $TY#d,($dat,$XX[0],4) 33160662d10Schristos add $TY#b,$TX[0]#b 33260662d10Schristos inc $XX[0]#b 33360662d10Schristos movl ($dat,$TX[0],4),$TY#d 33460662d10Schristos movl ($dat,$XX[0],4),$TX[0]#d 33560662d10Schristos xorb ($inp),$TY#b 33660662d10Schristos movb $TY#b,($out,$inp) 33760662d10Schristos lea 1($inp),$inp 33860662d10Schristos dec $len 33960662d10Schristos jnz .Lloop1 34060662d10Schristos jmp .Lexit 34160662d10Schristos 34260662d10Schristos.align 16 34360662d10Schristos.LRC4_CHAR: 34460662d10Schristos add \$1,$XX[0]#b 34560662d10Schristos movzb ($dat,$XX[0]),$TX[0]#d 34660662d10Schristos test \$-8,$len 34760662d10Schristos jz .Lcloop1 34860662d10Schristos jmp .Lcloop8 34960662d10Schristos.align 16 35060662d10Schristos.Lcloop8: 35160662d10Schristos mov ($inp),%r8d 35260662d10Schristos mov 4($inp),%r9d 35360662d10Schristos___ 35460662d10Schristos# unroll 2x4-wise, because 64-bit rotates kill Intel P4... 35560662d10Schristosfor ($i=0;$i<4;$i++) { 35660662d10Schristos$code.=<<___; 35760662d10Schristos add $TX[0]#b,$YY#b 35860662d10Schristos lea 1($XX[0]),$XX[1] 35960662d10Schristos movzb ($dat,$YY),$TY#d 36060662d10Schristos movzb $XX[1]#b,$XX[1]#d 36160662d10Schristos movzb ($dat,$XX[1]),$TX[1]#d 36260662d10Schristos movb $TX[0]#b,($dat,$YY) 36360662d10Schristos cmp $XX[1],$YY 36460662d10Schristos movb $TY#b,($dat,$XX[0]) 36560662d10Schristos jne .Lcmov$i # Intel cmov is sloooow... 36660662d10Schristos mov $TX[0],$TX[1] 36760662d10Schristos.Lcmov$i: 36860662d10Schristos add $TX[0]#b,$TY#b 36960662d10Schristos xor ($dat,$TY),%r8b 37060662d10Schristos ror \$8,%r8d 37160662d10Schristos___ 37260662d10Schristospush(@TX,shift(@TX)); push(@XX,shift(@XX)); # "rotate" registers 37360662d10Schristos} 37460662d10Schristosfor ($i=4;$i<8;$i++) { 37560662d10Schristos$code.=<<___; 37660662d10Schristos add $TX[0]#b,$YY#b 37760662d10Schristos lea 1($XX[0]),$XX[1] 37860662d10Schristos movzb ($dat,$YY),$TY#d 37960662d10Schristos movzb $XX[1]#b,$XX[1]#d 38060662d10Schristos movzb ($dat,$XX[1]),$TX[1]#d 38160662d10Schristos movb $TX[0]#b,($dat,$YY) 38260662d10Schristos cmp $XX[1],$YY 38360662d10Schristos movb $TY#b,($dat,$XX[0]) 38460662d10Schristos jne .Lcmov$i # Intel cmov is sloooow... 38560662d10Schristos mov $TX[0],$TX[1] 38660662d10Schristos.Lcmov$i: 38760662d10Schristos add $TX[0]#b,$TY#b 38860662d10Schristos xor ($dat,$TY),%r9b 38960662d10Schristos ror \$8,%r9d 39060662d10Schristos___ 39160662d10Schristospush(@TX,shift(@TX)); push(@XX,shift(@XX)); # "rotate" registers 39260662d10Schristos} 39360662d10Schristos$code.=<<___; 39460662d10Schristos lea -8($len),$len 39560662d10Schristos mov %r8d,($out) 39660662d10Schristos lea 8($inp),$inp 39760662d10Schristos mov %r9d,4($out) 39860662d10Schristos lea 8($out),$out 39960662d10Schristos 40060662d10Schristos test \$-8,$len 40160662d10Schristos jnz .Lcloop8 40260662d10Schristos cmp \$0,$len 40360662d10Schristos jne .Lcloop1 40460662d10Schristos jmp .Lexit 40560662d10Schristos___ 40660662d10Schristos$code.=<<___; 40760662d10Schristos.align 16 40860662d10Schristos.Lcloop1: 40960662d10Schristos add $TX[0]#b,$YY#b 41060662d10Schristos movzb $YY#b,$YY#d 41160662d10Schristos movzb ($dat,$YY),$TY#d 41260662d10Schristos movb $TX[0]#b,($dat,$YY) 41360662d10Schristos movb $TY#b,($dat,$XX[0]) 41460662d10Schristos add $TX[0]#b,$TY#b 41560662d10Schristos add \$1,$XX[0]#b 41660662d10Schristos movzb $TY#b,$TY#d 41760662d10Schristos movzb $XX[0]#b,$XX[0]#d 41860662d10Schristos movzb ($dat,$TY),$TY#d 41960662d10Schristos movzb ($dat,$XX[0]),$TX[0]#d 42060662d10Schristos xorb ($inp),$TY#b 42160662d10Schristos lea 1($inp),$inp 42260662d10Schristos movb $TY#b,($out) 42360662d10Schristos lea 1($out),$out 42460662d10Schristos sub \$1,$len 42560662d10Schristos jnz .Lcloop1 42660662d10Schristos jmp .Lexit 42760662d10Schristos 42860662d10Schristos.align 16 42960662d10Schristos.Lexit: 43060662d10Schristos sub \$1,$XX[0]#b 43160662d10Schristos movl $XX[0]#d,-8($dat) 43260662d10Schristos movl $YY#d,-4($dat) 43360662d10Schristos 43460662d10Schristos mov (%rsp),%r13 435*1dcdf01fSchristos.cfi_restore %r13 43660662d10Schristos mov 8(%rsp),%r12 437*1dcdf01fSchristos.cfi_restore %r12 43860662d10Schristos mov 16(%rsp),%rbx 439*1dcdf01fSchristos.cfi_restore %rbx 44060662d10Schristos add \$24,%rsp 441*1dcdf01fSchristos.cfi_adjust_cfa_offset -24 44260662d10Schristos.Lepilogue: 44360662d10Schristos ret 444*1dcdf01fSchristos.cfi_endproc 44560662d10Schristos.size RC4,.-RC4 44660662d10Schristos___ 44760662d10Schristos} 44860662d10Schristos 44960662d10Schristos$idx="%r8"; 45060662d10Schristos$ido="%r9"; 45160662d10Schristos 45260662d10Schristos$code.=<<___; 453*1dcdf01fSchristos.globl RC4_set_key 454*1dcdf01fSchristos.type RC4_set_key,\@function,3 45560662d10Schristos.align 16 456*1dcdf01fSchristosRC4_set_key: 457*1dcdf01fSchristos.cfi_startproc 45860662d10Schristos lea 8($dat),$dat 45960662d10Schristos lea ($inp,$len),$inp 46060662d10Schristos neg $len 46160662d10Schristos mov $len,%rcx 46260662d10Schristos xor %eax,%eax 46360662d10Schristos xor $ido,$ido 46460662d10Schristos xor %r10,%r10 46560662d10Schristos xor %r11,%r11 46660662d10Schristos 46760662d10Schristos mov OPENSSL_ia32cap_P(%rip),$idx#d 46860662d10Schristos bt \$20,$idx#d # RC4_CHAR? 46960662d10Schristos jc .Lc1stloop 47060662d10Schristos jmp .Lw1stloop 47160662d10Schristos 47260662d10Schristos.align 16 47360662d10Schristos.Lw1stloop: 47460662d10Schristos mov %eax,($dat,%rax,4) 47560662d10Schristos add \$1,%al 47660662d10Schristos jnc .Lw1stloop 47760662d10Schristos 47860662d10Schristos xor $ido,$ido 47960662d10Schristos xor $idx,$idx 48060662d10Schristos.align 16 48160662d10Schristos.Lw2ndloop: 48260662d10Schristos mov ($dat,$ido,4),%r10d 48360662d10Schristos add ($inp,$len,1),$idx#b 48460662d10Schristos add %r10b,$idx#b 48560662d10Schristos add \$1,$len 48660662d10Schristos mov ($dat,$idx,4),%r11d 48760662d10Schristos cmovz %rcx,$len 48860662d10Schristos mov %r10d,($dat,$idx,4) 48960662d10Schristos mov %r11d,($dat,$ido,4) 49060662d10Schristos add \$1,$ido#b 49160662d10Schristos jnc .Lw2ndloop 49260662d10Schristos jmp .Lexit_key 49360662d10Schristos 49460662d10Schristos.align 16 49560662d10Schristos.Lc1stloop: 49660662d10Schristos mov %al,($dat,%rax) 49760662d10Schristos add \$1,%al 49860662d10Schristos jnc .Lc1stloop 49960662d10Schristos 50060662d10Schristos xor $ido,$ido 50160662d10Schristos xor $idx,$idx 50260662d10Schristos.align 16 50360662d10Schristos.Lc2ndloop: 50460662d10Schristos mov ($dat,$ido),%r10b 50560662d10Schristos add ($inp,$len),$idx#b 50660662d10Schristos add %r10b,$idx#b 50760662d10Schristos add \$1,$len 50860662d10Schristos mov ($dat,$idx),%r11b 50960662d10Schristos jnz .Lcnowrap 51060662d10Schristos mov %rcx,$len 51160662d10Schristos.Lcnowrap: 51260662d10Schristos mov %r10b,($dat,$idx) 51360662d10Schristos mov %r11b,($dat,$ido) 51460662d10Schristos add \$1,$ido#b 51560662d10Schristos jnc .Lc2ndloop 51660662d10Schristos movl \$-1,256($dat) 51760662d10Schristos 51860662d10Schristos.align 16 51960662d10Schristos.Lexit_key: 52060662d10Schristos xor %eax,%eax 52160662d10Schristos mov %eax,-8($dat) 52260662d10Schristos mov %eax,-4($dat) 52360662d10Schristos ret 524*1dcdf01fSchristos.cfi_endproc 525*1dcdf01fSchristos.size RC4_set_key,.-RC4_set_key 52660662d10Schristos 52760662d10Schristos.globl RC4_options 52860662d10Schristos.type RC4_options,\@abi-omnipotent 52960662d10Schristos.align 16 53060662d10SchristosRC4_options: 531*1dcdf01fSchristos.cfi_startproc 53260662d10Schristos lea .Lopts(%rip),%rax 53360662d10Schristos mov OPENSSL_ia32cap_P(%rip),%edx 53460662d10Schristos bt \$20,%edx 53560662d10Schristos jc .L8xchar 53660662d10Schristos bt \$30,%edx 53760662d10Schristos jnc .Ldone 53860662d10Schristos add \$25,%rax 53960662d10Schristos ret 54060662d10Schristos.L8xchar: 54160662d10Schristos add \$12,%rax 54260662d10Schristos.Ldone: 54360662d10Schristos ret 544*1dcdf01fSchristos.cfi_endproc 54560662d10Schristos.align 64 54660662d10Schristos.Lopts: 54760662d10Schristos.asciz "rc4(8x,int)" 54860662d10Schristos.asciz "rc4(8x,char)" 54960662d10Schristos.asciz "rc4(16x,int)" 55060662d10Schristos.asciz "RC4 for x86_64, CRYPTOGAMS by <appro\@openssl.org>" 55160662d10Schristos.align 64 55260662d10Schristos.size RC4_options,.-RC4_options 55360662d10Schristos___ 55460662d10Schristos 55560662d10Schristos# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame, 55660662d10Schristos# CONTEXT *context,DISPATCHER_CONTEXT *disp) 55760662d10Schristosif ($win64) { 55860662d10Schristos$rec="%rcx"; 55960662d10Schristos$frame="%rdx"; 56060662d10Schristos$context="%r8"; 56160662d10Schristos$disp="%r9"; 56260662d10Schristos 56360662d10Schristos$code.=<<___; 56460662d10Schristos.extern __imp_RtlVirtualUnwind 56560662d10Schristos.type stream_se_handler,\@abi-omnipotent 56660662d10Schristos.align 16 56760662d10Schristosstream_se_handler: 56860662d10Schristos push %rsi 56960662d10Schristos push %rdi 57060662d10Schristos push %rbx 57160662d10Schristos push %rbp 57260662d10Schristos push %r12 57360662d10Schristos push %r13 57460662d10Schristos push %r14 57560662d10Schristos push %r15 57660662d10Schristos pushfq 57760662d10Schristos sub \$64,%rsp 57860662d10Schristos 57960662d10Schristos mov 120($context),%rax # pull context->Rax 58060662d10Schristos mov 248($context),%rbx # pull context->Rip 58160662d10Schristos 58260662d10Schristos lea .Lprologue(%rip),%r10 58360662d10Schristos cmp %r10,%rbx # context->Rip<prologue label 58460662d10Schristos jb .Lin_prologue 58560662d10Schristos 58660662d10Schristos mov 152($context),%rax # pull context->Rsp 58760662d10Schristos 58860662d10Schristos lea .Lepilogue(%rip),%r10 58960662d10Schristos cmp %r10,%rbx # context->Rip>=epilogue label 59060662d10Schristos jae .Lin_prologue 59160662d10Schristos 59260662d10Schristos lea 24(%rax),%rax 59360662d10Schristos 59460662d10Schristos mov -8(%rax),%rbx 59560662d10Schristos mov -16(%rax),%r12 59660662d10Schristos mov -24(%rax),%r13 59760662d10Schristos mov %rbx,144($context) # restore context->Rbx 59860662d10Schristos mov %r12,216($context) # restore context->R12 59960662d10Schristos mov %r13,224($context) # restore context->R13 60060662d10Schristos 60160662d10Schristos.Lin_prologue: 60260662d10Schristos mov 8(%rax),%rdi 60360662d10Schristos mov 16(%rax),%rsi 60460662d10Schristos mov %rax,152($context) # restore context->Rsp 60560662d10Schristos mov %rsi,168($context) # restore context->Rsi 60660662d10Schristos mov %rdi,176($context) # restore context->Rdi 60760662d10Schristos 60860662d10Schristos jmp .Lcommon_seh_exit 60960662d10Schristos.size stream_se_handler,.-stream_se_handler 61060662d10Schristos 61160662d10Schristos.type key_se_handler,\@abi-omnipotent 61260662d10Schristos.align 16 61360662d10Schristoskey_se_handler: 61460662d10Schristos push %rsi 61560662d10Schristos push %rdi 61660662d10Schristos push %rbx 61760662d10Schristos push %rbp 61860662d10Schristos push %r12 61960662d10Schristos push %r13 62060662d10Schristos push %r14 62160662d10Schristos push %r15 62260662d10Schristos pushfq 62360662d10Schristos sub \$64,%rsp 62460662d10Schristos 62560662d10Schristos mov 152($context),%rax # pull context->Rsp 62660662d10Schristos mov 8(%rax),%rdi 62760662d10Schristos mov 16(%rax),%rsi 62860662d10Schristos mov %rsi,168($context) # restore context->Rsi 62960662d10Schristos mov %rdi,176($context) # restore context->Rdi 63060662d10Schristos 63160662d10Schristos.Lcommon_seh_exit: 63260662d10Schristos 63360662d10Schristos mov 40($disp),%rdi # disp->ContextRecord 63460662d10Schristos mov $context,%rsi # context 63560662d10Schristos mov \$154,%ecx # sizeof(CONTEXT) 63660662d10Schristos .long 0xa548f3fc # cld; rep movsq 63760662d10Schristos 63860662d10Schristos mov $disp,%rsi 63960662d10Schristos xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER 64060662d10Schristos mov 8(%rsi),%rdx # arg2, disp->ImageBase 64160662d10Schristos mov 0(%rsi),%r8 # arg3, disp->ControlPc 64260662d10Schristos mov 16(%rsi),%r9 # arg4, disp->FunctionEntry 64360662d10Schristos mov 40(%rsi),%r10 # disp->ContextRecord 64460662d10Schristos lea 56(%rsi),%r11 # &disp->HandlerData 64560662d10Schristos lea 24(%rsi),%r12 # &disp->EstablisherFrame 64660662d10Schristos mov %r10,32(%rsp) # arg5 64760662d10Schristos mov %r11,40(%rsp) # arg6 64860662d10Schristos mov %r12,48(%rsp) # arg7 64960662d10Schristos mov %rcx,56(%rsp) # arg8, (NULL) 65060662d10Schristos call *__imp_RtlVirtualUnwind(%rip) 65160662d10Schristos 65260662d10Schristos mov \$1,%eax # ExceptionContinueSearch 65360662d10Schristos add \$64,%rsp 65460662d10Schristos popfq 65560662d10Schristos pop %r15 65660662d10Schristos pop %r14 65760662d10Schristos pop %r13 65860662d10Schristos pop %r12 65960662d10Schristos pop %rbp 66060662d10Schristos pop %rbx 66160662d10Schristos pop %rdi 66260662d10Schristos pop %rsi 66360662d10Schristos ret 66460662d10Schristos.size key_se_handler,.-key_se_handler 66560662d10Schristos 66660662d10Schristos.section .pdata 66760662d10Schristos.align 4 66860662d10Schristos .rva .LSEH_begin_RC4 66960662d10Schristos .rva .LSEH_end_RC4 67060662d10Schristos .rva .LSEH_info_RC4 67160662d10Schristos 672*1dcdf01fSchristos .rva .LSEH_begin_RC4_set_key 673*1dcdf01fSchristos .rva .LSEH_end_RC4_set_key 674*1dcdf01fSchristos .rva .LSEH_info_RC4_set_key 67560662d10Schristos 67660662d10Schristos.section .xdata 67760662d10Schristos.align 8 67860662d10Schristos.LSEH_info_RC4: 67960662d10Schristos .byte 9,0,0,0 68060662d10Schristos .rva stream_se_handler 681*1dcdf01fSchristos.LSEH_info_RC4_set_key: 68260662d10Schristos .byte 9,0,0,0 68360662d10Schristos .rva key_se_handler 68460662d10Schristos___ 68560662d10Schristos} 68660662d10Schristos 68760662d10Schristossub reg_part { 68860662d10Schristosmy ($reg,$conv)=@_; 68960662d10Schristos if ($reg =~ /%r[0-9]+/) { $reg .= $conv; } 69060662d10Schristos elsif ($conv eq "b") { $reg =~ s/%[er]([^x]+)x?/%$1l/; } 69160662d10Schristos elsif ($conv eq "w") { $reg =~ s/%[er](.+)/%$1/; } 69260662d10Schristos elsif ($conv eq "d") { $reg =~ s/%[er](.+)/%e$1/; } 69360662d10Schristos return $reg; 69460662d10Schristos} 69560662d10Schristos 69660662d10Schristos$code =~ s/(%[a-z0-9]+)#([bwd])/reg_part($1,$2)/gem; 69760662d10Schristos$code =~ s/\`([^\`]*)\`/eval $1/gem; 69860662d10Schristos 69960662d10Schristosprint $code; 70060662d10Schristos 701*1dcdf01fSchristosclose STDOUT or die "error closing STDOUT: $!"; 702