160662d10Schristos#! /usr/bin/env perl
2*1dcdf01fSchristos# Copyright 2005-2020 The OpenSSL Project Authors. All Rights Reserved.
3*1dcdf01fSchristos#
4*1dcdf01fSchristos# Licensed under the OpenSSL license (the "License").  You may not use
5*1dcdf01fSchristos# this file except in compliance with the License.  You can obtain a copy
6*1dcdf01fSchristos# in the file LICENSE in the source distribution or at
7*1dcdf01fSchristos# https://www.openssl.org/source/license.html
8*1dcdf01fSchristos
960662d10Schristos#
1060662d10Schristos# ====================================================================
11*1dcdf01fSchristos# Written by Andy Polyakov <appro@openssl.org> for the OpenSSL
1260662d10Schristos# project. The module is, however, dual licensed under OpenSSL and
1360662d10Schristos# CRYPTOGAMS licenses depending on where you obtain it. For further
1460662d10Schristos# details see http://www.openssl.org/~appro/cryptogams/.
1560662d10Schristos# ====================================================================
1660662d10Schristos#
1760662d10Schristos# July 2004
1860662d10Schristos#
1960662d10Schristos# 2.22x RC4 tune-up:-) It should be noted though that my hand [as in
2060662d10Schristos# "hand-coded assembler"] doesn't stand for the whole improvement
2160662d10Schristos# coefficient. It turned out that eliminating RC4_CHAR from config
2260662d10Schristos# line results in ~40% improvement (yes, even for C implementation).
2360662d10Schristos# Presumably it has everything to do with AMD cache architecture and
2460662d10Schristos# RAW or whatever penalties. Once again! The module *requires* config
2560662d10Schristos# line *without* RC4_CHAR! As for coding "secret," I bet on partial
2660662d10Schristos# register arithmetics. For example instead of 'inc %r8; and $255,%r8'
2760662d10Schristos# I simply 'inc %r8b'. Even though optimization manual discourages
2860662d10Schristos# to operate on partial registers, it turned out to be the best bet.
2960662d10Schristos# At least for AMD... How IA32E would perform remains to be seen...
3060662d10Schristos
3160662d10Schristos# November 2004
3260662d10Schristos#
3360662d10Schristos# As was shown by Marc Bevand reordering of couple of load operations
3460662d10Schristos# results in even higher performance gain of 3.3x:-) At least on
3560662d10Schristos# Opteron... For reference, 1x in this case is RC4_CHAR C-code
3660662d10Schristos# compiled with gcc 3.3.2, which performs at ~54MBps per 1GHz clock.
3760662d10Schristos# Latter means that if you want to *estimate* what to expect from
3860662d10Schristos# *your* Opteron, then multiply 54 by 3.3 and clock frequency in GHz.
3960662d10Schristos
4060662d10Schristos# November 2004
4160662d10Schristos#
4260662d10Schristos# Intel P4 EM64T core was found to run the AMD64 code really slow...
4360662d10Schristos# The only way to achieve comparable performance on P4 was to keep
4460662d10Schristos# RC4_CHAR. Kind of ironic, huh? As it's apparently impossible to
4560662d10Schristos# compose blended code, which would perform even within 30% marginal
4660662d10Schristos# on either AMD and Intel platforms, I implement both cases. See
4760662d10Schristos# rc4_skey.c for further details...
4860662d10Schristos
4960662d10Schristos# April 2005
5060662d10Schristos#
5160662d10Schristos# P4 EM64T core appears to be "allergic" to 64-bit inc/dec. Replacing
5260662d10Schristos# those with add/sub results in 50% performance improvement of folded
5360662d10Schristos# loop...
5460662d10Schristos
5560662d10Schristos# May 2005
5660662d10Schristos#
5760662d10Schristos# As was shown by Zou Nanhai loop unrolling can improve Intel EM64T
5860662d10Schristos# performance by >30% [unlike P4 32-bit case that is]. But this is
5960662d10Schristos# provided that loads are reordered even more aggressively! Both code
60*1dcdf01fSchristos# paths, AMD64 and EM64T, reorder loads in essentially same manner
6160662d10Schristos# as my IA-64 implementation. On Opteron this resulted in modest 5%
6260662d10Schristos# improvement [I had to test it], while final Intel P4 performance
6360662d10Schristos# achieves respectful 432MBps on 2.8GHz processor now. For reference.
6460662d10Schristos# If executed on Xeon, current RC4_CHAR code-path is 2.7x faster than
6560662d10Schristos# RC4_INT code-path. While if executed on Opteron, it's only 25%
6660662d10Schristos# slower than the RC4_INT one [meaning that if CPU µ-arch detection
6760662d10Schristos# is not implemented, then this final RC4_CHAR code-path should be
6860662d10Schristos# preferred, as it provides better *all-round* performance].
6960662d10Schristos
7060662d10Schristos# March 2007
7160662d10Schristos#
7260662d10Schristos# Intel Core2 was observed to perform poorly on both code paths:-( It
7360662d10Schristos# apparently suffers from some kind of partial register stall, which
7460662d10Schristos# occurs in 64-bit mode only [as virtually identical 32-bit loop was
7560662d10Schristos# observed to outperform 64-bit one by almost 50%]. Adding two movzb to
7660662d10Schristos# cloop1 boosts its performance by 80%! This loop appears to be optimal
7760662d10Schristos# fit for Core2 and therefore the code was modified to skip cloop8 on
7860662d10Schristos# this CPU.
7960662d10Schristos
8060662d10Schristos# May 2010
8160662d10Schristos#
8260662d10Schristos# Intel Westmere was observed to perform suboptimally. Adding yet
8360662d10Schristos# another movzb to cloop1 improved performance by almost 50%! Core2
8460662d10Schristos# performance is improved too, but nominally...
8560662d10Schristos
8660662d10Schristos# May 2011
8760662d10Schristos#
8860662d10Schristos# The only code path that was not modified is P4-specific one. Non-P4
8960662d10Schristos# Intel code path optimization is heavily based on submission by Maxim
9060662d10Schristos# Perminov, Maxim Locktyukhin and Jim Guilford of Intel. I've used
91*1dcdf01fSchristos# some of the ideas even in attempt to optimize the original RC4_INT
9260662d10Schristos# code path... Current performance in cycles per processed byte (less
9360662d10Schristos# is better) and improvement coefficients relative to previous
9460662d10Schristos# version of this module are:
9560662d10Schristos#
9660662d10Schristos# Opteron	5.3/+0%(*)
9760662d10Schristos# P4		6.5
9860662d10Schristos# Core2		6.2/+15%(**)
9960662d10Schristos# Westmere	4.2/+60%
10060662d10Schristos# Sandy Bridge	4.2/+120%
10160662d10Schristos# Atom		9.3/+80%
102*1dcdf01fSchristos# VIA Nano	6.4/+4%
103*1dcdf01fSchristos# Ivy Bridge	4.1/+30%
104*1dcdf01fSchristos# Bulldozer	4.5/+30%(*)
10560662d10Schristos#
10660662d10Schristos# (*)	But corresponding loop has less instructions, which should have
10760662d10Schristos#	positive effect on upcoming Bulldozer, which has one less ALU.
10860662d10Schristos#	For reference, Intel code runs at 6.8 cpb rate on Opteron.
10960662d10Schristos# (**)	Note that Core2 result is ~15% lower than corresponding result
11060662d10Schristos#	for 32-bit code, meaning that it's possible to improve it,
11160662d10Schristos#	but more than likely at the cost of the others (see rc4-586.pl
11260662d10Schristos#	to get the idea)...
11360662d10Schristos
11460662d10Schristos$flavour = shift;
11560662d10Schristos$output  = shift;
11660662d10Schristosif ($flavour =~ /\./) { $output = $flavour; undef $flavour; }
11760662d10Schristos
11860662d10Schristos$win64=0; $win64=1 if ($flavour =~ /[nm]asm|mingw64/ || $output =~ /\.asm$/);
11960662d10Schristos
12060662d10Schristos$0 =~ m/(.*[\/\\])[^\/\\]+$/; $dir=$1;
12160662d10Schristos( $xlate="${dir}x86_64-xlate.pl" and -f $xlate ) or
12260662d10Schristos( $xlate="${dir}../../perlasm/x86_64-xlate.pl" and -f $xlate) or
12360662d10Schristosdie "can't locate x86_64-xlate.pl";
12460662d10Schristos
125*1dcdf01fSchristosopen OUT,"| \"$^X\" \"$xlate\" $flavour \"$output\"";
12660662d10Schristos*STDOUT=*OUT;
12760662d10Schristos
12860662d10Schristos$dat="%rdi";	    # arg1
12960662d10Schristos$len="%rsi";	    # arg2
13060662d10Schristos$inp="%rdx";	    # arg3
13160662d10Schristos$out="%rcx";	    # arg4
13260662d10Schristos
13360662d10Schristos{
13460662d10Schristos$code=<<___;
13560662d10Schristos.text
13660662d10Schristos.extern	OPENSSL_ia32cap_P
13760662d10Schristos
13860662d10Schristos.globl	RC4
13960662d10Schristos.type	RC4,\@function,4
14060662d10Schristos.align	16
141*1dcdf01fSchristosRC4:
142*1dcdf01fSchristos.cfi_startproc
143*1dcdf01fSchristos	or	$len,$len
14460662d10Schristos	jne	.Lentry
14560662d10Schristos	ret
14660662d10Schristos.Lentry:
14760662d10Schristos	push	%rbx
148*1dcdf01fSchristos.cfi_push	%rbx
14960662d10Schristos	push	%r12
150*1dcdf01fSchristos.cfi_push	%r12
15160662d10Schristos	push	%r13
152*1dcdf01fSchristos.cfi_push	%r13
15360662d10Schristos.Lprologue:
15460662d10Schristos	mov	$len,%r11
15560662d10Schristos	mov	$inp,%r12
15660662d10Schristos	mov	$out,%r13
15760662d10Schristos___
15860662d10Schristosmy $len="%r11";		# reassign input arguments
15960662d10Schristosmy $inp="%r12";
16060662d10Schristosmy $out="%r13";
16160662d10Schristos
16260662d10Schristosmy @XX=("%r10","%rsi");
16360662d10Schristosmy @TX=("%rax","%rbx");
16460662d10Schristosmy $YY="%rcx";
16560662d10Schristosmy $TY="%rdx";
16660662d10Schristos
16760662d10Schristos$code.=<<___;
16860662d10Schristos	xor	$XX[0],$XX[0]
16960662d10Schristos	xor	$YY,$YY
17060662d10Schristos
17160662d10Schristos	lea	8($dat),$dat
17260662d10Schristos	mov	-8($dat),$XX[0]#b
17360662d10Schristos	mov	-4($dat),$YY#b
17460662d10Schristos	cmpl	\$-1,256($dat)
17560662d10Schristos	je	.LRC4_CHAR
17660662d10Schristos	mov	OPENSSL_ia32cap_P(%rip),%r8d
17760662d10Schristos	xor	$TX[1],$TX[1]
17860662d10Schristos	inc	$XX[0]#b
17960662d10Schristos	sub	$XX[0],$TX[1]
18060662d10Schristos	sub	$inp,$out
18160662d10Schristos	movl	($dat,$XX[0],4),$TX[0]#d
18260662d10Schristos	test	\$-16,$len
18360662d10Schristos	jz	.Lloop1
18460662d10Schristos	bt	\$30,%r8d	# Intel CPU?
18560662d10Schristos	jc	.Lintel
18660662d10Schristos	and	\$7,$TX[1]
18760662d10Schristos	lea	1($XX[0]),$XX[1]
18860662d10Schristos	jz	.Loop8
18960662d10Schristos	sub	$TX[1],$len
19060662d10Schristos.Loop8_warmup:
19160662d10Schristos	add	$TX[0]#b,$YY#b
19260662d10Schristos	movl	($dat,$YY,4),$TY#d
19360662d10Schristos	movl	$TX[0]#d,($dat,$YY,4)
19460662d10Schristos	movl	$TY#d,($dat,$XX[0],4)
19560662d10Schristos	add	$TY#b,$TX[0]#b
19660662d10Schristos	inc	$XX[0]#b
19760662d10Schristos	movl	($dat,$TX[0],4),$TY#d
19860662d10Schristos	movl	($dat,$XX[0],4),$TX[0]#d
19960662d10Schristos	xorb	($inp),$TY#b
20060662d10Schristos	movb	$TY#b,($out,$inp)
20160662d10Schristos	lea	1($inp),$inp
20260662d10Schristos	dec	$TX[1]
20360662d10Schristos	jnz	.Loop8_warmup
20460662d10Schristos
20560662d10Schristos	lea	1($XX[0]),$XX[1]
20660662d10Schristos	jmp	.Loop8
20760662d10Schristos.align	16
20860662d10Schristos.Loop8:
20960662d10Schristos___
21060662d10Schristosfor ($i=0;$i<8;$i++) {
21160662d10Schristos$code.=<<___ if ($i==7);
21260662d10Schristos	add	\$8,$XX[1]#b
21360662d10Schristos___
21460662d10Schristos$code.=<<___;
21560662d10Schristos	add	$TX[0]#b,$YY#b
21660662d10Schristos	movl	($dat,$YY,4),$TY#d
21760662d10Schristos	movl	$TX[0]#d,($dat,$YY,4)
21860662d10Schristos	movl	`4*($i==7?-1:$i)`($dat,$XX[1],4),$TX[1]#d
21960662d10Schristos	ror	\$8,%r8				# ror is redundant when $i=0
22060662d10Schristos	movl	$TY#d,4*$i($dat,$XX[0],4)
22160662d10Schristos	add	$TX[0]#b,$TY#b
22260662d10Schristos	movb	($dat,$TY,4),%r8b
22360662d10Schristos___
22460662d10Schristospush(@TX,shift(@TX)); #push(@XX,shift(@XX));	# "rotate" registers
22560662d10Schristos}
22660662d10Schristos$code.=<<___;
22760662d10Schristos	add	\$8,$XX[0]#b
22860662d10Schristos	ror	\$8,%r8
22960662d10Schristos	sub	\$8,$len
23060662d10Schristos
23160662d10Schristos	xor	($inp),%r8
23260662d10Schristos	mov	%r8,($out,$inp)
23360662d10Schristos	lea	8($inp),$inp
23460662d10Schristos
23560662d10Schristos	test	\$-8,$len
23660662d10Schristos	jnz	.Loop8
23760662d10Schristos	cmp	\$0,$len
23860662d10Schristos	jne	.Lloop1
23960662d10Schristos	jmp	.Lexit
24060662d10Schristos
24160662d10Schristos.align	16
24260662d10Schristos.Lintel:
24360662d10Schristos	test	\$-32,$len
24460662d10Schristos	jz	.Lloop1
24560662d10Schristos	and	\$15,$TX[1]
24660662d10Schristos	jz	.Loop16_is_hot
24760662d10Schristos	sub	$TX[1],$len
24860662d10Schristos.Loop16_warmup:
24960662d10Schristos	add	$TX[0]#b,$YY#b
25060662d10Schristos	movl	($dat,$YY,4),$TY#d
25160662d10Schristos	movl	$TX[0]#d,($dat,$YY,4)
25260662d10Schristos	movl	$TY#d,($dat,$XX[0],4)
25360662d10Schristos	add	$TY#b,$TX[0]#b
25460662d10Schristos	inc	$XX[0]#b
25560662d10Schristos	movl	($dat,$TX[0],4),$TY#d
25660662d10Schristos	movl	($dat,$XX[0],4),$TX[0]#d
25760662d10Schristos	xorb	($inp),$TY#b
25860662d10Schristos	movb	$TY#b,($out,$inp)
25960662d10Schristos	lea	1($inp),$inp
26060662d10Schristos	dec	$TX[1]
26160662d10Schristos	jnz	.Loop16_warmup
26260662d10Schristos
26360662d10Schristos	mov	$YY,$TX[1]
26460662d10Schristos	xor	$YY,$YY
26560662d10Schristos	mov	$TX[1]#b,$YY#b
26660662d10Schristos
26760662d10Schristos.Loop16_is_hot:
26860662d10Schristos	lea	($dat,$XX[0],4),$XX[1]
26960662d10Schristos___
27060662d10Schristossub RC4_loop {
27160662d10Schristos  my $i=shift;
27260662d10Schristos  my $j=$i<0?0:$i;
27360662d10Schristos  my $xmm="%xmm".($j&1);
27460662d10Schristos
27560662d10Schristos    $code.="	add	\$16,$XX[0]#b\n"		if ($i==15);
27660662d10Schristos    $code.="	movdqu	($inp),%xmm2\n"			if ($i==15);
27760662d10Schristos    $code.="	add	$TX[0]#b,$YY#b\n"		if ($i<=0);
27860662d10Schristos    $code.="	movl	($dat,$YY,4),$TY#d\n";
27960662d10Schristos    $code.="	pxor	%xmm0,%xmm2\n"			if ($i==0);
28060662d10Schristos    $code.="	psllq	\$8,%xmm1\n"			if ($i==0);
28160662d10Schristos    $code.="	pxor	$xmm,$xmm\n"			if ($i<=1);
28260662d10Schristos    $code.="	movl	$TX[0]#d,($dat,$YY,4)\n";
28360662d10Schristos    $code.="	add	$TY#b,$TX[0]#b\n";
28460662d10Schristos    $code.="	movl	`4*($j+1)`($XX[1]),$TX[1]#d\n"	if ($i<15);
28560662d10Schristos    $code.="	movz	$TX[0]#b,$TX[0]#d\n";
28660662d10Schristos    $code.="	movl	$TY#d,4*$j($XX[1])\n";
28760662d10Schristos    $code.="	pxor	%xmm1,%xmm2\n"			if ($i==0);
28860662d10Schristos    $code.="	lea	($dat,$XX[0],4),$XX[1]\n"	if ($i==15);
28960662d10Schristos    $code.="	add	$TX[1]#b,$YY#b\n"		if ($i<15);
29060662d10Schristos    $code.="	pinsrw	\$`($j>>1)&7`,($dat,$TX[0],4),$xmm\n";
29160662d10Schristos    $code.="	movdqu	%xmm2,($out,$inp)\n"		if ($i==0);
29260662d10Schristos    $code.="	lea	16($inp),$inp\n"		if ($i==0);
29360662d10Schristos    $code.="	movl	($XX[1]),$TX[1]#d\n"		if ($i==15);
29460662d10Schristos}
29560662d10Schristos	RC4_loop(-1);
29660662d10Schristos$code.=<<___;
29760662d10Schristos	jmp	.Loop16_enter
29860662d10Schristos.align	16
29960662d10Schristos.Loop16:
30060662d10Schristos___
30160662d10Schristos
30260662d10Schristosfor ($i=0;$i<16;$i++) {
30360662d10Schristos    $code.=".Loop16_enter:\n"		if ($i==1);
30460662d10Schristos	RC4_loop($i);
30560662d10Schristos	push(@TX,shift(@TX)); 		# "rotate" registers
30660662d10Schristos}
30760662d10Schristos$code.=<<___;
30860662d10Schristos	mov	$YY,$TX[1]
30960662d10Schristos	xor	$YY,$YY			# keyword to partial register
31060662d10Schristos	sub	\$16,$len
31160662d10Schristos	mov	$TX[1]#b,$YY#b
31260662d10Schristos	test	\$-16,$len
31360662d10Schristos	jnz	.Loop16
31460662d10Schristos
31560662d10Schristos	psllq	\$8,%xmm1
31660662d10Schristos	pxor	%xmm0,%xmm2
31760662d10Schristos	pxor	%xmm1,%xmm2
31860662d10Schristos	movdqu	%xmm2,($out,$inp)
31960662d10Schristos	lea	16($inp),$inp
32060662d10Schristos
32160662d10Schristos	cmp	\$0,$len
32260662d10Schristos	jne	.Lloop1
32360662d10Schristos	jmp	.Lexit
32460662d10Schristos
32560662d10Schristos.align	16
32660662d10Schristos.Lloop1:
32760662d10Schristos	add	$TX[0]#b,$YY#b
32860662d10Schristos	movl	($dat,$YY,4),$TY#d
32960662d10Schristos	movl	$TX[0]#d,($dat,$YY,4)
33060662d10Schristos	movl	$TY#d,($dat,$XX[0],4)
33160662d10Schristos	add	$TY#b,$TX[0]#b
33260662d10Schristos	inc	$XX[0]#b
33360662d10Schristos	movl	($dat,$TX[0],4),$TY#d
33460662d10Schristos	movl	($dat,$XX[0],4),$TX[0]#d
33560662d10Schristos	xorb	($inp),$TY#b
33660662d10Schristos	movb	$TY#b,($out,$inp)
33760662d10Schristos	lea	1($inp),$inp
33860662d10Schristos	dec	$len
33960662d10Schristos	jnz	.Lloop1
34060662d10Schristos	jmp	.Lexit
34160662d10Schristos
34260662d10Schristos.align	16
34360662d10Schristos.LRC4_CHAR:
34460662d10Schristos	add	\$1,$XX[0]#b
34560662d10Schristos	movzb	($dat,$XX[0]),$TX[0]#d
34660662d10Schristos	test	\$-8,$len
34760662d10Schristos	jz	.Lcloop1
34860662d10Schristos	jmp	.Lcloop8
34960662d10Schristos.align	16
35060662d10Schristos.Lcloop8:
35160662d10Schristos	mov	($inp),%r8d
35260662d10Schristos	mov	4($inp),%r9d
35360662d10Schristos___
35460662d10Schristos# unroll 2x4-wise, because 64-bit rotates kill Intel P4...
35560662d10Schristosfor ($i=0;$i<4;$i++) {
35660662d10Schristos$code.=<<___;
35760662d10Schristos	add	$TX[0]#b,$YY#b
35860662d10Schristos	lea	1($XX[0]),$XX[1]
35960662d10Schristos	movzb	($dat,$YY),$TY#d
36060662d10Schristos	movzb	$XX[1]#b,$XX[1]#d
36160662d10Schristos	movzb	($dat,$XX[1]),$TX[1]#d
36260662d10Schristos	movb	$TX[0]#b,($dat,$YY)
36360662d10Schristos	cmp	$XX[1],$YY
36460662d10Schristos	movb	$TY#b,($dat,$XX[0])
36560662d10Schristos	jne	.Lcmov$i			# Intel cmov is sloooow...
36660662d10Schristos	mov	$TX[0],$TX[1]
36760662d10Schristos.Lcmov$i:
36860662d10Schristos	add	$TX[0]#b,$TY#b
36960662d10Schristos	xor	($dat,$TY),%r8b
37060662d10Schristos	ror	\$8,%r8d
37160662d10Schristos___
37260662d10Schristospush(@TX,shift(@TX)); push(@XX,shift(@XX));	# "rotate" registers
37360662d10Schristos}
37460662d10Schristosfor ($i=4;$i<8;$i++) {
37560662d10Schristos$code.=<<___;
37660662d10Schristos	add	$TX[0]#b,$YY#b
37760662d10Schristos	lea	1($XX[0]),$XX[1]
37860662d10Schristos	movzb	($dat,$YY),$TY#d
37960662d10Schristos	movzb	$XX[1]#b,$XX[1]#d
38060662d10Schristos	movzb	($dat,$XX[1]),$TX[1]#d
38160662d10Schristos	movb	$TX[0]#b,($dat,$YY)
38260662d10Schristos	cmp	$XX[1],$YY
38360662d10Schristos	movb	$TY#b,($dat,$XX[0])
38460662d10Schristos	jne	.Lcmov$i			# Intel cmov is sloooow...
38560662d10Schristos	mov	$TX[0],$TX[1]
38660662d10Schristos.Lcmov$i:
38760662d10Schristos	add	$TX[0]#b,$TY#b
38860662d10Schristos	xor	($dat,$TY),%r9b
38960662d10Schristos	ror	\$8,%r9d
39060662d10Schristos___
39160662d10Schristospush(@TX,shift(@TX)); push(@XX,shift(@XX));	# "rotate" registers
39260662d10Schristos}
39360662d10Schristos$code.=<<___;
39460662d10Schristos	lea	-8($len),$len
39560662d10Schristos	mov	%r8d,($out)
39660662d10Schristos	lea	8($inp),$inp
39760662d10Schristos	mov	%r9d,4($out)
39860662d10Schristos	lea	8($out),$out
39960662d10Schristos
40060662d10Schristos	test	\$-8,$len
40160662d10Schristos	jnz	.Lcloop8
40260662d10Schristos	cmp	\$0,$len
40360662d10Schristos	jne	.Lcloop1
40460662d10Schristos	jmp	.Lexit
40560662d10Schristos___
40660662d10Schristos$code.=<<___;
40760662d10Schristos.align	16
40860662d10Schristos.Lcloop1:
40960662d10Schristos	add	$TX[0]#b,$YY#b
41060662d10Schristos	movzb	$YY#b,$YY#d
41160662d10Schristos	movzb	($dat,$YY),$TY#d
41260662d10Schristos	movb	$TX[0]#b,($dat,$YY)
41360662d10Schristos	movb	$TY#b,($dat,$XX[0])
41460662d10Schristos	add	$TX[0]#b,$TY#b
41560662d10Schristos	add	\$1,$XX[0]#b
41660662d10Schristos	movzb	$TY#b,$TY#d
41760662d10Schristos	movzb	$XX[0]#b,$XX[0]#d
41860662d10Schristos	movzb	($dat,$TY),$TY#d
41960662d10Schristos	movzb	($dat,$XX[0]),$TX[0]#d
42060662d10Schristos	xorb	($inp),$TY#b
42160662d10Schristos	lea	1($inp),$inp
42260662d10Schristos	movb	$TY#b,($out)
42360662d10Schristos	lea	1($out),$out
42460662d10Schristos	sub	\$1,$len
42560662d10Schristos	jnz	.Lcloop1
42660662d10Schristos	jmp	.Lexit
42760662d10Schristos
42860662d10Schristos.align	16
42960662d10Schristos.Lexit:
43060662d10Schristos	sub	\$1,$XX[0]#b
43160662d10Schristos	movl	$XX[0]#d,-8($dat)
43260662d10Schristos	movl	$YY#d,-4($dat)
43360662d10Schristos
43460662d10Schristos	mov	(%rsp),%r13
435*1dcdf01fSchristos.cfi_restore	%r13
43660662d10Schristos	mov	8(%rsp),%r12
437*1dcdf01fSchristos.cfi_restore	%r12
43860662d10Schristos	mov	16(%rsp),%rbx
439*1dcdf01fSchristos.cfi_restore	%rbx
44060662d10Schristos	add	\$24,%rsp
441*1dcdf01fSchristos.cfi_adjust_cfa_offset	-24
44260662d10Schristos.Lepilogue:
44360662d10Schristos	ret
444*1dcdf01fSchristos.cfi_endproc
44560662d10Schristos.size	RC4,.-RC4
44660662d10Schristos___
44760662d10Schristos}
44860662d10Schristos
44960662d10Schristos$idx="%r8";
45060662d10Schristos$ido="%r9";
45160662d10Schristos
45260662d10Schristos$code.=<<___;
453*1dcdf01fSchristos.globl	RC4_set_key
454*1dcdf01fSchristos.type	RC4_set_key,\@function,3
45560662d10Schristos.align	16
456*1dcdf01fSchristosRC4_set_key:
457*1dcdf01fSchristos.cfi_startproc
45860662d10Schristos	lea	8($dat),$dat
45960662d10Schristos	lea	($inp,$len),$inp
46060662d10Schristos	neg	$len
46160662d10Schristos	mov	$len,%rcx
46260662d10Schristos	xor	%eax,%eax
46360662d10Schristos	xor	$ido,$ido
46460662d10Schristos	xor	%r10,%r10
46560662d10Schristos	xor	%r11,%r11
46660662d10Schristos
46760662d10Schristos	mov	OPENSSL_ia32cap_P(%rip),$idx#d
46860662d10Schristos	bt	\$20,$idx#d	# RC4_CHAR?
46960662d10Schristos	jc	.Lc1stloop
47060662d10Schristos	jmp	.Lw1stloop
47160662d10Schristos
47260662d10Schristos.align	16
47360662d10Schristos.Lw1stloop:
47460662d10Schristos	mov	%eax,($dat,%rax,4)
47560662d10Schristos	add	\$1,%al
47660662d10Schristos	jnc	.Lw1stloop
47760662d10Schristos
47860662d10Schristos	xor	$ido,$ido
47960662d10Schristos	xor	$idx,$idx
48060662d10Schristos.align	16
48160662d10Schristos.Lw2ndloop:
48260662d10Schristos	mov	($dat,$ido,4),%r10d
48360662d10Schristos	add	($inp,$len,1),$idx#b
48460662d10Schristos	add	%r10b,$idx#b
48560662d10Schristos	add	\$1,$len
48660662d10Schristos	mov	($dat,$idx,4),%r11d
48760662d10Schristos	cmovz	%rcx,$len
48860662d10Schristos	mov	%r10d,($dat,$idx,4)
48960662d10Schristos	mov	%r11d,($dat,$ido,4)
49060662d10Schristos	add	\$1,$ido#b
49160662d10Schristos	jnc	.Lw2ndloop
49260662d10Schristos	jmp	.Lexit_key
49360662d10Schristos
49460662d10Schristos.align	16
49560662d10Schristos.Lc1stloop:
49660662d10Schristos	mov	%al,($dat,%rax)
49760662d10Schristos	add	\$1,%al
49860662d10Schristos	jnc	.Lc1stloop
49960662d10Schristos
50060662d10Schristos	xor	$ido,$ido
50160662d10Schristos	xor	$idx,$idx
50260662d10Schristos.align	16
50360662d10Schristos.Lc2ndloop:
50460662d10Schristos	mov	($dat,$ido),%r10b
50560662d10Schristos	add	($inp,$len),$idx#b
50660662d10Schristos	add	%r10b,$idx#b
50760662d10Schristos	add	\$1,$len
50860662d10Schristos	mov	($dat,$idx),%r11b
50960662d10Schristos	jnz	.Lcnowrap
51060662d10Schristos	mov	%rcx,$len
51160662d10Schristos.Lcnowrap:
51260662d10Schristos	mov	%r10b,($dat,$idx)
51360662d10Schristos	mov	%r11b,($dat,$ido)
51460662d10Schristos	add	\$1,$ido#b
51560662d10Schristos	jnc	.Lc2ndloop
51660662d10Schristos	movl	\$-1,256($dat)
51760662d10Schristos
51860662d10Schristos.align	16
51960662d10Schristos.Lexit_key:
52060662d10Schristos	xor	%eax,%eax
52160662d10Schristos	mov	%eax,-8($dat)
52260662d10Schristos	mov	%eax,-4($dat)
52360662d10Schristos	ret
524*1dcdf01fSchristos.cfi_endproc
525*1dcdf01fSchristos.size	RC4_set_key,.-RC4_set_key
52660662d10Schristos
52760662d10Schristos.globl	RC4_options
52860662d10Schristos.type	RC4_options,\@abi-omnipotent
52960662d10Schristos.align	16
53060662d10SchristosRC4_options:
531*1dcdf01fSchristos.cfi_startproc
53260662d10Schristos	lea	.Lopts(%rip),%rax
53360662d10Schristos	mov	OPENSSL_ia32cap_P(%rip),%edx
53460662d10Schristos	bt	\$20,%edx
53560662d10Schristos	jc	.L8xchar
53660662d10Schristos	bt	\$30,%edx
53760662d10Schristos	jnc	.Ldone
53860662d10Schristos	add	\$25,%rax
53960662d10Schristos	ret
54060662d10Schristos.L8xchar:
54160662d10Schristos	add	\$12,%rax
54260662d10Schristos.Ldone:
54360662d10Schristos	ret
544*1dcdf01fSchristos.cfi_endproc
54560662d10Schristos.align	64
54660662d10Schristos.Lopts:
54760662d10Schristos.asciz	"rc4(8x,int)"
54860662d10Schristos.asciz	"rc4(8x,char)"
54960662d10Schristos.asciz	"rc4(16x,int)"
55060662d10Schristos.asciz	"RC4 for x86_64, CRYPTOGAMS by <appro\@openssl.org>"
55160662d10Schristos.align	64
55260662d10Schristos.size	RC4_options,.-RC4_options
55360662d10Schristos___
55460662d10Schristos
55560662d10Schristos# EXCEPTION_DISPOSITION handler (EXCEPTION_RECORD *rec,ULONG64 frame,
55660662d10Schristos#		CONTEXT *context,DISPATCHER_CONTEXT *disp)
55760662d10Schristosif ($win64) {
55860662d10Schristos$rec="%rcx";
55960662d10Schristos$frame="%rdx";
56060662d10Schristos$context="%r8";
56160662d10Schristos$disp="%r9";
56260662d10Schristos
56360662d10Schristos$code.=<<___;
56460662d10Schristos.extern	__imp_RtlVirtualUnwind
56560662d10Schristos.type	stream_se_handler,\@abi-omnipotent
56660662d10Schristos.align	16
56760662d10Schristosstream_se_handler:
56860662d10Schristos	push	%rsi
56960662d10Schristos	push	%rdi
57060662d10Schristos	push	%rbx
57160662d10Schristos	push	%rbp
57260662d10Schristos	push	%r12
57360662d10Schristos	push	%r13
57460662d10Schristos	push	%r14
57560662d10Schristos	push	%r15
57660662d10Schristos	pushfq
57760662d10Schristos	sub	\$64,%rsp
57860662d10Schristos
57960662d10Schristos	mov	120($context),%rax	# pull context->Rax
58060662d10Schristos	mov	248($context),%rbx	# pull context->Rip
58160662d10Schristos
58260662d10Schristos	lea	.Lprologue(%rip),%r10
58360662d10Schristos	cmp	%r10,%rbx		# context->Rip<prologue label
58460662d10Schristos	jb	.Lin_prologue
58560662d10Schristos
58660662d10Schristos	mov	152($context),%rax	# pull context->Rsp
58760662d10Schristos
58860662d10Schristos	lea	.Lepilogue(%rip),%r10
58960662d10Schristos	cmp	%r10,%rbx		# context->Rip>=epilogue label
59060662d10Schristos	jae	.Lin_prologue
59160662d10Schristos
59260662d10Schristos	lea	24(%rax),%rax
59360662d10Schristos
59460662d10Schristos	mov	-8(%rax),%rbx
59560662d10Schristos	mov	-16(%rax),%r12
59660662d10Schristos	mov	-24(%rax),%r13
59760662d10Schristos	mov	%rbx,144($context)	# restore context->Rbx
59860662d10Schristos	mov	%r12,216($context)	# restore context->R12
59960662d10Schristos	mov	%r13,224($context)	# restore context->R13
60060662d10Schristos
60160662d10Schristos.Lin_prologue:
60260662d10Schristos	mov	8(%rax),%rdi
60360662d10Schristos	mov	16(%rax),%rsi
60460662d10Schristos	mov	%rax,152($context)	# restore context->Rsp
60560662d10Schristos	mov	%rsi,168($context)	# restore context->Rsi
60660662d10Schristos	mov	%rdi,176($context)	# restore context->Rdi
60760662d10Schristos
60860662d10Schristos	jmp	.Lcommon_seh_exit
60960662d10Schristos.size	stream_se_handler,.-stream_se_handler
61060662d10Schristos
61160662d10Schristos.type	key_se_handler,\@abi-omnipotent
61260662d10Schristos.align	16
61360662d10Schristoskey_se_handler:
61460662d10Schristos	push	%rsi
61560662d10Schristos	push	%rdi
61660662d10Schristos	push	%rbx
61760662d10Schristos	push	%rbp
61860662d10Schristos	push	%r12
61960662d10Schristos	push	%r13
62060662d10Schristos	push	%r14
62160662d10Schristos	push	%r15
62260662d10Schristos	pushfq
62360662d10Schristos	sub	\$64,%rsp
62460662d10Schristos
62560662d10Schristos	mov	152($context),%rax	# pull context->Rsp
62660662d10Schristos	mov	8(%rax),%rdi
62760662d10Schristos	mov	16(%rax),%rsi
62860662d10Schristos	mov	%rsi,168($context)	# restore context->Rsi
62960662d10Schristos	mov	%rdi,176($context)	# restore context->Rdi
63060662d10Schristos
63160662d10Schristos.Lcommon_seh_exit:
63260662d10Schristos
63360662d10Schristos	mov	40($disp),%rdi		# disp->ContextRecord
63460662d10Schristos	mov	$context,%rsi		# context
63560662d10Schristos	mov	\$154,%ecx		# sizeof(CONTEXT)
63660662d10Schristos	.long	0xa548f3fc		# cld; rep movsq
63760662d10Schristos
63860662d10Schristos	mov	$disp,%rsi
63960662d10Schristos	xor	%rcx,%rcx		# arg1, UNW_FLAG_NHANDLER
64060662d10Schristos	mov	8(%rsi),%rdx		# arg2, disp->ImageBase
64160662d10Schristos	mov	0(%rsi),%r8		# arg3, disp->ControlPc
64260662d10Schristos	mov	16(%rsi),%r9		# arg4, disp->FunctionEntry
64360662d10Schristos	mov	40(%rsi),%r10		# disp->ContextRecord
64460662d10Schristos	lea	56(%rsi),%r11		# &disp->HandlerData
64560662d10Schristos	lea	24(%rsi),%r12		# &disp->EstablisherFrame
64660662d10Schristos	mov	%r10,32(%rsp)		# arg5
64760662d10Schristos	mov	%r11,40(%rsp)		# arg6
64860662d10Schristos	mov	%r12,48(%rsp)		# arg7
64960662d10Schristos	mov	%rcx,56(%rsp)		# arg8, (NULL)
65060662d10Schristos	call	*__imp_RtlVirtualUnwind(%rip)
65160662d10Schristos
65260662d10Schristos	mov	\$1,%eax		# ExceptionContinueSearch
65360662d10Schristos	add	\$64,%rsp
65460662d10Schristos	popfq
65560662d10Schristos	pop	%r15
65660662d10Schristos	pop	%r14
65760662d10Schristos	pop	%r13
65860662d10Schristos	pop	%r12
65960662d10Schristos	pop	%rbp
66060662d10Schristos	pop	%rbx
66160662d10Schristos	pop	%rdi
66260662d10Schristos	pop	%rsi
66360662d10Schristos	ret
66460662d10Schristos.size	key_se_handler,.-key_se_handler
66560662d10Schristos
66660662d10Schristos.section	.pdata
66760662d10Schristos.align	4
66860662d10Schristos	.rva	.LSEH_begin_RC4
66960662d10Schristos	.rva	.LSEH_end_RC4
67060662d10Schristos	.rva	.LSEH_info_RC4
67160662d10Schristos
672*1dcdf01fSchristos	.rva	.LSEH_begin_RC4_set_key
673*1dcdf01fSchristos	.rva	.LSEH_end_RC4_set_key
674*1dcdf01fSchristos	.rva	.LSEH_info_RC4_set_key
67560662d10Schristos
67660662d10Schristos.section	.xdata
67760662d10Schristos.align	8
67860662d10Schristos.LSEH_info_RC4:
67960662d10Schristos	.byte	9,0,0,0
68060662d10Schristos	.rva	stream_se_handler
681*1dcdf01fSchristos.LSEH_info_RC4_set_key:
68260662d10Schristos	.byte	9,0,0,0
68360662d10Schristos	.rva	key_se_handler
68460662d10Schristos___
68560662d10Schristos}
68660662d10Schristos
68760662d10Schristossub reg_part {
68860662d10Schristosmy ($reg,$conv)=@_;
68960662d10Schristos    if ($reg =~ /%r[0-9]+/)	{ $reg .= $conv; }
69060662d10Schristos    elsif ($conv eq "b")	{ $reg =~ s/%[er]([^x]+)x?/%$1l/;	}
69160662d10Schristos    elsif ($conv eq "w")	{ $reg =~ s/%[er](.+)/%$1/;		}
69260662d10Schristos    elsif ($conv eq "d")	{ $reg =~ s/%[er](.+)/%e$1/;		}
69360662d10Schristos    return $reg;
69460662d10Schristos}
69560662d10Schristos
69660662d10Schristos$code =~ s/(%[a-z0-9]+)#([bwd])/reg_part($1,$2)/gem;
69760662d10Schristos$code =~ s/\`([^\`]*)\`/eval $1/gem;
69860662d10Schristos
69960662d10Schristosprint $code;
70060662d10Schristos
701*1dcdf01fSchristosclose STDOUT or die "error closing STDOUT: $!";
702