#
fa58b594 |
| 12-Feb-2024 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: modify pci health check
Today we read PCI VENDOR-ID in order to make sure PCI link is healthy. Apparently the VENDOR-ID might be stored on host and hence, when we read it we might
accel/habanalabs: modify pci health check
Today we read PCI VENDOR-ID in order to make sure PCI link is healthy. Apparently the VENDOR-ID might be stored on host and hence, when we read it we might not access the PCI bus. In order to make sure PCI health check is reliable, we will start checking the DEVICE-ID instead.
Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
731d320e |
| 01-Jan-2024 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs: remove call to deprecated function
In newer kernel versions, irq_set_affinity_hint() is deprecated. Instead, use the newer version which is irq_set_affinity_and_hint().
Signed-off
accel/habanalabs: remove call to deprecated function
In newer kernel versions, irq_set_affinity_hint() is deprecated. Instead, use the newer version which is irq_set_affinity_and_hint().
Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
246d8b6c |
| 24-Dec-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: abort device reset for consecutive heartbeat failures
The mechanism of aborting device reset for consecutive fatal errors is currently only for fatal errors that are reported by FW
accel/habanalabs: abort device reset for consecutive heartbeat failures
The mechanism of aborting device reset for consecutive fatal errors is currently only for fatal errors that are reported by FW. A non-responsive FW and consecutive heartbeat failures is also considered fatal, so add them as well to this mechanism to avoid recurring device reset in such a case.
Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
d0df8a35 |
| 14-Dec-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: fix DRAM BAR base address calculation
When the DRAM region size in the BAR is not a power of 2, calculating the corresponding BAR base address should be done using the offset from
accel/habanalabs: fix DRAM BAR base address calculation
When the DRAM region size in the BAR is not a power of 2, calculating the corresponding BAR base address should be done using the offset from the DRAM start address, and not using directly the DRAM address.
Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
e91c37f1 |
| 21-Sep-2023 |
Dani Liberman <dliberman@habana.ai> |
accel/habanalabs/gaudi2: add interrupt affinity for user interrupts
User interrupts are MSIx interrupts coming from Gaudi2, that have specific range of IDs and are assigned to the sole use of the us
accel/habanalabs/gaudi2: add interrupt affinity for user interrupts
User interrupts are MSIx interrupts coming from Gaudi2, that have specific range of IDs and are assigned to the sole use of the user process that opened the Gaudi2 device (reminder: there can be only a single user process running on Gaudi2 at any given time).
The interrupts are allocated and managed by the driver and therefore, the user expects the driver to initialize them properly, which also includes setting the affinity to the related CPU cores of the device's NUMA node to get maximum performance.
Signed-off-by: Dani Liberman <dliberman@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
4b0b1fbc |
| 20-Jul-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: set hard reset flag if graceful reset is skipped
hl_device_cond_reset() might be called with the hard reset flag unset, because a compute reset upon device release as part of a gra
accel/habanalabs: set hard reset flag if graceful reset is skipped
hl_device_cond_reset() might be called with the hard reset flag unset, because a compute reset upon device release as part of a graceful reset is valid. If the conditions for graceful reset are not met, hl_device_reset() will be called for an immediate reset. In this case a compute reset is not valid, so it will be replaced with a hard reset together with a debug message about it. This message might be confusing, as it implies that a compute reset was requested when it shouldn't. To prevent this confusion, set the hard reset flag in hl_device_cond_reset() if going to an immediate reset.
Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
d1958dce |
| 31-Oct-2023 |
Farah Kassabri <fkassabri@habana.ai> |
accel/habanalabs: fix EQ heartbeat mechanism
Stop rescheduling another heartbeat check when EQ heartbeat check fails as it generates confusing logs in dmesg that the heartbeat fails.
Signed-off-by:
accel/habanalabs: fix EQ heartbeat mechanism
Stop rescheduling another heartbeat check when EQ heartbeat check fails as it generates confusing logs in dmesg that the heartbeat fails.
Signed-off-by: Farah Kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
42422993 |
| 30-Oct-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs: add support for Gaudi2C device
Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C device and should be detected and initialized as Gaudi2.
Signed-off-by: Oded Ga
accel/habanalabs: add support for Gaudi2C device
Gaudi2 with PCI revision ID with the value of '3' represents Gaudi2C device and should be detected and initialized as Gaudi2.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
e8bc0c1b |
| 29-Oct-2023 |
Farah Kassabri <fkassabri@habana.ai> |
accel/habanalabs: add log when eq event is not received
Add error log when no eq event is received from FW, to cover a scenario when FW is stuck for some reason. In such case driver will not receive
accel/habanalabs: add log when eq event is not received
Add error log when no eq event is received from FW, to cover a scenario when FW is stuck for some reason. In such case driver will not receive neither the eq error interrupt or the eq heartbeat event, and will just initiate a reset without indication in the dmesg about the reason.
Signed-off-by: Farah Kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
3652117f |
| 22-Nov-2023 |
Christian Brauner <brauner@kernel.org> |
eventfd: simplify eventfd_signal()
Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed
eventfd: simplify eventfd_signal()
Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument.
Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
show more ...
|
#
6fc69ca8 |
| 21-Sep-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs: print device name when it is removed
Notifies the user which device was removed. It is important in a server with multiple devices.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
accel/habanalabs: print device name when it is removed
Notifies the user which device was removed. It is important in a server with multiple devices.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
show more ...
|
#
ff92d010 |
| 27-Aug-2023 |
Ohad Sharabi <osharabi@habana.ai> |
accel/habanalabs: trace dma map sgtable
Traces the DMA [un]map_sgtable using the new traces we added.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
accel/habanalabs: trace dma map sgtable
Traces the DMA [un]map_sgtable using the new traces we added.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
1157b5d6 |
| 02-Mar-2023 |
farah kassabri <fkassabri@habana.ai> |
accel/habanalabs: optimize timestamp registration handler
Currently we use dynamic allocation inside the irq handler in order to allocate free node to be used for the free jobs.
This operation is e
accel/habanalabs: optimize timestamp registration handler
Currently we use dynamic allocation inside the irq handler in order to allocate free node to be used for the free jobs.
This operation is expensive, especially when we deal with large burst of events records that get released at the same time.
The alternative is to have pre allocated pool of free nodes and just fetch nodes from this pool at irq handling time instead of allocating them.
In case the pool becomes full, then the driver will fallback to dynamic allocations.
As part of the optimization also update the unregister flow upon re-using a timestamp record, by making the operation much simpler and quicker. We already have the record in the registration flow and now we just seek to re-use with different interrupt. Therefore, no need to look for buffer according to the user handle.
Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Tomer Tayar <ttayar@habana.ai> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
051868d9 |
| 27-Aug-2023 |
farah kassabri <fkassabri@habana.ai> |
accel/habanalabs: prevent sending heartbeat before events are enabled
After the heartbeat mechanism is now expanded to be used also for EQ health check, we shouldn't send heartbeat messages to FW be
accel/habanalabs: prevent sending heartbeat before events are enabled
After the heartbeat mechanism is now expanded to be used also for EQ health check, we shouldn't send heartbeat messages to FW before driver allow events to be received from FW.
Because if the driver will send two heartbeats before it enables events to be received from FW, then the EQ health check will fail and reset the device.
Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
7c4130e6 |
| 08-Aug-2023 |
farah kassabri <fkassabri@habana.ai> |
accel/habanalabs/gaudi2: handle eq health heartbeat check
Add mechanism for fw eq health check. this will be done using two flows: using the heartbeat mechanism and raising a dedicated interrupt to
accel/habanalabs/gaudi2: handle eq health heartbeat check
Add mechanism for fw eq health check. this will be done using two flows: using the heartbeat mechanism and raising a dedicated interrupt to indicate an eq failure like EQ full. This patch will add implementation for the eq heartbeat for gaudi2 asic.
More info about the heartbeat mechanism: Expand the heartbeat mechanism to monitor a new event that will be sent from FW upon receiving heartbeat message. that way driver can know that the eq is working or not.
Signed-off-by: farah kassabri <fkassabri@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
e0f45280 |
| 25-Jun-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: fix inline doc typos
Fix two typos
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel
accel/habanalabs: fix inline doc typos
Fix two typos
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
ab574f6a |
| 25-Jun-2023 |
Dafna Hirschfeld <dhirschfeld@habana.ai> |
accel/habanalabs: disable events ioctls on control device
Because it is not used and also, for graceful reset to work those ioctls should run on the compute device.
Signed-off-by: Dafna Hirschfeld
accel/habanalabs: disable events ioctls on control device
Because it is not used and also, for graceful reset to work those ioctls should run on the compute device.
Signed-off-by: Dafna Hirschfeld <dhirschfeld@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
fe77368c |
| 19-Feb-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: register compute device as an accel device
Register the compute device as an accel device, and remove the creation of the habanalabs compute char device.
The IOCTLs in this patch
accel/habanalabs: register compute device as an accel device
Register the compute device as an accel device, and remove the creation of the habanalabs compute char device.
The IOCTLs in this patch are still handled by the current driver handler. Moving to DRM IOCTL handling requires moving the IOCTLs numbers to a specific range, so it will be handled in subsequent patches.
Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
a8ab1a81 |
| 23-May-2023 |
Ofir Bitton <obitton@habana.ai> |
accel/habanalabs: add info ioctl for engine error reports
User gets notification for every engine error report, but he still lacks the exact engine information. Hence, we allow user to query for the
accel/habanalabs: add info ioctl for engine error reports
User gets notification for every engine error report, but he still lacks the exact engine information. Hence, we allow user to query for the exact engine reported an error.
Signed-off-by: Ofir Bitton <obitton@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
10926f60 |
| 13-Jun-2023 |
Tomer Tayar <ttayar@habana.ai> |
accel/habanalabs: set default device release watchdog T/O as 30 sec
After being notified about certain errors, user is expected to finish his post-errors actions and to release the device within som
accel/habanalabs: set default device release watchdog T/O as 30 sec
After being notified about certain errors, user is expected to finish his post-errors actions and to release the device within some timeout, after which is deice is being reset. The default timeout value is 5 sec, which in some case is not enough for a user application to collect debug data. Increase the default value to 30 sec.
Signed-off-by: Tomer Tayar <ttayar@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
37d72439 |
| 12-Jun-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs: reset device if scrubbing failed
If scrubbing memory after user released device has failed it means the device is in a bad state and should be reset.
Signed-off-by: Oded Gabbay <o
accel/habanalabs: reset device if scrubbing failed
If scrubbing memory after user released device has failed it means the device is in a bad state and should be reset.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
show more ...
|
#
89803af5 |
| 12-Jun-2023 |
Oded Gabbay <ogabbay@kernel.org> |
accel/habanalabs: remove pdev check on idle check
Our simulator supports idle check so no need anymore to check if pdev exists.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bit
accel/habanalabs: remove pdev check on idle check
Our simulator supports idle check so no need anymore to check if pdev exists.
Signed-off-by: Oded Gabbay <ogabbay@kernel.org> Reviewed-by: Ofir Bitton <obitton@habana.ai>
show more ...
|
#
964b1f67 |
| 06-Jun-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: rename fd_list to hpriv_list
Every time an FD is returned to the user, the driver adds a corresponding private structure to the list. Yet, it's still a list of private structures r
accel/habanalabs: rename fd_list to hpriv_list
Every time an FD is returned to the user, the driver adds a corresponding private structure to the list. Yet, it's still a list of private structures rather than of FDs. Remove, as well, an unnecessary comment.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
942f18c5 |
| 06-Jun-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: call put_pid after hpriv list is updated
Because we might still be using related resources, decrementing PID's reference count should be done at later stages of the device release.
accel/habanalabs: call put_pid after hpriv list is updated
Because we might still be using related resources, decrementing PID's reference count should be done at later stages of the device release. A good place is right after the representing private structure is removed from LKD's list.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|
#
2b541cf9 |
| 05-Jun-2023 |
Koby Elbaz <kelbaz@habana.ai> |
accel/habanalabs: print return code when process termination fails
As part of driver teardown, we attempt to kill all user processes. It shouldn't fail, but if it does we want to print the error cod
accel/habanalabs: print return code when process termination fails
As part of driver teardown, we attempt to kill all user processes. It shouldn't fail, but if it does we want to print the error code that the kapi returned to us.
Signed-off-by: Koby Elbaz <kelbaz@habana.ai> Reviewed-by: Oded Gabbay <ogabbay@kernel.org> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
show more ...
|