History log of /linux/drivers/accel/habanalabs/common/device.c (Results 1 – 25 of 62)
Revision Date Author Comments
# fa58b594 12-Feb-2024 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: modify pci health check

Today we read PCI VENDOR-ID in order to make sure PCI link is
healthy. Apparently the VENDOR-ID might be stored on host and
hence, when we read it we might

accel/habanalabs: modify pci health check

Today we read PCI VENDOR-ID in order to make sure PCI link is
healthy. Apparently the VENDOR-ID might be stored on host and
hence, when we read it we might not access the PCI bus.
In order to make sure PCI health check is reliable, we will start
checking the DEVICE-ID instead.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 731d320e 01-Jan-2024 Dani Liberman <dliberman@habana.ai>

accel/habanalabs: remove call to deprecated function

In newer kernel versions, irq_set_affinity_hint() is deprecated.
Instead, use the newer version which is irq_set_affinity_and_hint().

Signed-off

accel/habanalabs: remove call to deprecated function

In newer kernel versions, irq_set_affinity_hint() is deprecated.
Instead, use the newer version which is irq_set_affinity_and_hint().

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 246d8b6c 24-Dec-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: abort device reset for consecutive heartbeat failures

The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW

accel/habanalabs: abort device reset for consecutive heartbeat failures

The mechanism of aborting device reset for consecutive fatal errors is
currently only for fatal errors that are reported by FW.
A non-responsive FW and consecutive heartbeat failures is also
considered fatal, so add them as well to this mechanism to avoid
recurring device reset in such a case.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# d0df8a35 14-Dec-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: fix DRAM BAR base address calculation

When the DRAM region size in the BAR is not a power of 2, calculating
the corresponding BAR base address should be done using the offset from

accel/habanalabs: fix DRAM BAR base address calculation

When the DRAM region size in the BAR is not a power of 2, calculating
the corresponding BAR base address should be done using the offset from
the DRAM start address, and not using directly the DRAM address.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# e91c37f1 21-Sep-2023 Dani Liberman <dliberman@habana.ai>

accel/habanalabs/gaudi2: add interrupt affinity for user interrupts

User interrupts are MSIx interrupts coming from Gaudi2, that have
specific range of IDs and are assigned to the sole use of the us

accel/habanalabs/gaudi2: add interrupt affinity for user interrupts

User interrupts are MSIx interrupts coming from Gaudi2, that have
specific range of IDs and are assigned to the sole use of the user
process that opened the Gaudi2 device (reminder: there can be only
a single user process running on Gaudi2 at any given time).

The interrupts are allocated and managed by the driver and therefore,
the user expects the driver to initialize them properly, which also
includes setting the affinity to the related CPU cores of the
device's NUMA node to get maximum performance.

Signed-off-by: Dani Liberman <dliberman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 4b0b1fbc 20-Jul-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: set hard reset flag if graceful reset is skipped

hl_device_cond_reset() might be called with the hard reset flag unset,
because a compute reset upon device release as part of a gra

accel/habanalabs: set hard reset flag if graceful reset is skipped

hl_device_cond_reset() might be called with the hard reset flag unset,
because a compute reset upon device release as part of a graceful reset
is valid.
If the conditions for graceful reset are not met, hl_device_reset() will
be called for an immediate reset. In this case a compute reset is not
valid, so it will be replaced with a hard reset together with a debug
message about it.
This message might be confusing, as it implies that a compute reset was
requested when it shouldn't. To prevent this confusion, set the hard
reset flag in hl_device_cond_reset() if going to an immediate reset.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# d1958dce 31-Oct-2023 Farah Kassabri <fkassabri@habana.ai>

accel/habanalabs: fix EQ heartbeat mechanism

Stop rescheduling another heartbeat check when EQ heartbeat check fails
as it generates confusing logs in dmesg that the heartbeat fails.

Signed-off-by:

accel/habanalabs: fix EQ heartbeat mechanism

Stop rescheduling another heartbeat check when EQ heartbeat check fails
as it generates confusing logs in dmesg that the heartbeat fails.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 42422993 30-Oct-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: add support for Gaudi2C device

Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C
device and should be detected and initialized as Gaudi2.

Signed-off-by: Oded Ga

accel/habanalabs: add support for Gaudi2C device

Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C
device and should be detected and initialized as Gaudi2.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# e8bc0c1b 29-Oct-2023 Farah Kassabri <fkassabri@habana.ai>

accel/habanalabs: add log when eq event is not received

Add error log when no eq event is received from FW,
to cover a scenario when FW is stuck for some reason.
In such case driver will not receive

accel/habanalabs: add log when eq event is not received

Add error log when no eq event is received from FW,
to cover a scenario when FW is stuck for some reason.
In such case driver will not receive neither the eq error interrupt
or the eq heartbeat event, and will just initiate a reset without
indication in the dmesg about the reason.

Signed-off-by: Farah Kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 3652117f 22-Nov-2023 Christian Brauner <brauner@kernel.org>

eventfd: simplify eventfd_signal()

Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed

eventfd: simplify eventfd_signal()

Ever since the eventfd type was introduced back in 2007 in commit
e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org
Acked-by: Xu Yilun <yilun.xu@intel.com>
Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl
Acked-by: Eric Farman <farman@linux.ibm.com> # s390
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>

show more ...


# 6fc69ca8 21-Sep-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: print device name when it is removed

Notifies the user which device was removed. It is important in
a server with multiple devices.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: print device name when it is removed

Notifies the user which device was removed. It is important in
a server with multiple devices.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>

show more ...


# ff92d010 27-Aug-2023 Ohad Sharabi <osharabi@habana.ai>

accel/habanalabs: trace dma map sgtable

Traces the DMA [un]map_sgtable using the new traces we added.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: trace dma map sgtable

Traces the DMA [un]map_sgtable using the new traces we added.

Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 1157b5d6 02-Mar-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs: optimize timestamp registration handler

Currently we use dynamic allocation inside the irq handler
in order to allocate free node to be used for the free jobs.

This operation is e

accel/habanalabs: optimize timestamp registration handler

Currently we use dynamic allocation inside the irq handler
in order to allocate free node to be used for the free jobs.

This operation is expensive, especially when we deal with large
burst of events records that get released at the same time.

The alternative is to have pre allocated pool of free nodes
and just fetch nodes from this pool at irq handling time instead
of allocating them.

In case the pool becomes full, then the driver will fallback to
dynamic allocations.

As part of the optimization also update the unregister flow
upon re-using a timestamp record, by making the operation much
simpler and quicker. We already have the record in the registration
flow and now we just seek to re-use with different interrupt.
Therefore, no need to look for buffer according to the user handle.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Tomer Tayar <ttayar@habana.ai>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 051868d9 27-Aug-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs: prevent sending heartbeat before events are enabled

After the heartbeat mechanism is now expanded to be used also
for EQ health check, we shouldn't send heartbeat messages
to FW be

accel/habanalabs: prevent sending heartbeat before events are enabled

After the heartbeat mechanism is now expanded to be used also
for EQ health check, we shouldn't send heartbeat messages
to FW before driver allow events to be received from FW.

Because if the driver will send two heartbeats before it enables
events to be received from FW, then the EQ health check
will fail and reset the device.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 7c4130e6 08-Aug-2023 farah kassabri <fkassabri@habana.ai>

accel/habanalabs/gaudi2: handle eq health heartbeat check

Add mechanism for fw eq health check. this will be done using two flows:
using the heartbeat mechanism and raising a dedicated interrupt to

accel/habanalabs/gaudi2: handle eq health heartbeat check

Add mechanism for fw eq health check. this will be done using two flows:
using the heartbeat mechanism and raising a dedicated interrupt to
indicate an eq failure like EQ full.
This patch will add implementation for the eq heartbeat for gaudi2 asic.

More info about the heartbeat mechanism:
Expand the heartbeat mechanism to monitor a new event that
will be sent from FW upon receiving heartbeat message.
that way driver can know that the eq is working or not.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# e0f45280 25-Jun-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: fix inline doc typos

Fix two typos

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel

accel/habanalabs: fix inline doc typos

Fix two typos

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# ab574f6a 25-Jun-2023 Dafna Hirschfeld <dhirschfeld@habana.ai>

accel/habanalabs: disable events ioctls on control device

Because it is not used and also, for graceful reset to work
those ioctls should run on the compute device.

Signed-off-by: Dafna Hirschfeld

accel/habanalabs: disable events ioctls on control device

Because it is not used and also, for graceful reset to work
those ioctls should run on the compute device.

Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# fe77368c 19-Feb-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: register compute device as an accel device

Register the compute device as an accel device, and remove the creation
of the habanalabs compute char device.

The IOCTLs in this patch

accel/habanalabs: register compute device as an accel device

Register the compute device as an accel device, and remove the creation
of the habanalabs compute char device.

The IOCTLs in this patch are still handled by the current driver
handler. Moving to DRM IOCTL handling requires moving the IOCTLs
numbers to a specific range, so it will be handled in subsequent
patches.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# a8ab1a81 23-May-2023 Ofir Bitton <obitton@habana.ai>

accel/habanalabs: add info ioctl for engine error reports

User gets notification for every engine error report, but he still
lacks the exact engine information. Hence, we allow user to query
for the

accel/habanalabs: add info ioctl for engine error reports

User gets notification for every engine error report, but he still
lacks the exact engine information. Hence, we allow user to query
for the exact engine reported an error.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 10926f60 13-Jun-2023 Tomer Tayar <ttayar@habana.ai>

accel/habanalabs: set default device release watchdog T/O as 30 sec

After being notified about certain errors, user is expected to finish
his post-errors actions and to release the device within som

accel/habanalabs: set default device release watchdog T/O as 30 sec

After being notified about certain errors, user is expected to finish
his post-errors actions and to release the device within some timeout,
after which is deice is being reset.
The default timeout value is 5 sec, which in some case is not enough for
a user application to collect debug data.
Increase the default value to 30 sec.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 37d72439 12-Jun-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: reset device if scrubbing failed

If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <o

accel/habanalabs: reset device if scrubbing failed

If scrubbing memory after user released device has failed it means
the device is in a bad state and should be reset.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>

show more ...


# 89803af5 12-Jun-2023 Oded Gabbay <ogabbay@kernel.org>

accel/habanalabs: remove pdev check on idle check

Our simulator supports idle check so no need anymore to check if pdev
exists.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bit

accel/habanalabs: remove pdev check on idle check

Our simulator supports idle check so no need anymore to check if pdev
exists.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Ofir Bitton <obitton@habana.ai>

show more ...


# 964b1f67 06-Jun-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: rename fd_list to hpriv_list

Every time an FD is returned to the user, the driver adds
a corresponding private structure to the list.
Yet, it's still a list of private structures r

accel/habanalabs: rename fd_list to hpriv_list

Every time an FD is returned to the user, the driver adds
a corresponding private structure to the list.
Yet, it's still a list of private structures rather than of FDs.
Remove, as well, an unnecessary comment.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 942f18c5 06-Jun-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: call put_pid after hpriv list is updated

Because we might still be using related resources, decrementing PID's
reference count should be done at later stages of the device release.

accel/habanalabs: call put_pid after hpriv list is updated

Because we might still be using related resources, decrementing PID's
reference count should be done at later stages of the device release.
A good place is right after the representing private structure is
removed from LKD's list.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


# 2b541cf9 05-Jun-2023 Koby Elbaz <kelbaz@habana.ai>

accel/habanalabs: print return code when process termination fails

As part of driver teardown, we attempt to kill all user processes.
It shouldn't fail, but if it does we want to print the error cod

accel/habanalabs: print return code when process termination fails

As part of driver teardown, we attempt to kill all user processes.
It shouldn't fail, but if it does we want to print the error code that
the kapi returned to us.

Signed-off-by: Koby Elbaz <kelbaz@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>

show more ...


123