Btrfs: Allocator Hints

This commit is contained in:
Forza 2024-12-01 09:56:36 +01:00
parent 1f4101c55a
commit 0ade2f1039
7 changed files with 2709 additions and 0 deletions

View File

@ -0,0 +1,41 @@
# Allocator Hints for Btrfs
This repository contains patches for adding allocator hints to Btrfs. These patches allow users to configure the Btrfs chunk allocator to prioritise specific devices for metadata or data allocation. This is particularly useful for mixed-device setups, such as combining SSDs with HDDs or NVMe devices with SSDs, to optimise performance and storage utilisation.
## About the Patches
The allocator hints patches were initially proposed in a series of early submissions to the Linux Btrfs mailing list in 2022 by Goffredo Baroncelli. Since then, the patches have been maintained and updated by Kakra to ensure compatibility with newer kernels and to add fixes and enhancements.
### Available Versions
The patches in this repository are named according to the kernel version they are designed for. For example:
- **btrfs_allocator_hints-6.1.patch**: For kernel version 6.1
- **btrfs_allocator_hints-6.6.patch**: For kernel version 6.6 (initial version)
- **btrfs_allocator_hints-6.6_v2.patch**: Kernel version 6.6 with additional fixes
- **btrfs_allocator_hints-6.6_v3.patch**: Kernel version 6.6 with further tweaks
- **btrfs_allocator_hints-6.6_v4.patch**: Latest patch for kernel version 6.6
- **btrfs_allocator_hints-6.12_v1.patch**: For kernel version 6.12
Each patch is tailored to work with its respective kernel version. Ensure you match the patch version to your Linux kernel version to avoid compatibility issues.
### Original Discussion
You can find the original discussion about these patches on the Linux Btrfs mailing list:
- [Allocator Hints Patch Series on lore.kernel.org](https://lore.kernel.org/all/Yhk5kyZL1J8hoQvX%40zen/T/)
### Author of this patch set
This repository is a mirror of the patches that have been graciously compiled and maintained by Kakra. They are available on his GitHub page:
[Kakra's GitHub Repository](https://github.com/kakra/linux/pull/36)
### More Information
For an in-depth explanation of the patches, how they work, and real-world examples, visit the wiki page:
- [Allocator Hints on TNOnline Wiki](https://wiki.tnonline.net/w/Btrfs/Allocator_Hints)
## Disclaimer
These patches are experimental and not officially included in the Linux kernel. Use them at your own risk. Always ensure you have backups of your data before testing these patches in production environments.
For questions or feedback, feel free to visit the wiki or reach out via [irc](https://wiki.tnonline.net/w/Btrfs/IRC).
## License
The licensing of these patches has not been explicitly stated in this repository. However, given their origin and intended use with the Linux kernel, they are likely under the terms of the **GNU General Public License, version 2 (GPLv2)**, which is the license of the [Linux kernel](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/COPYING) itself.
If you have any doubts or require confirmation, it is recommended to contact the original authors of the patches (e.g., Goffredo Baroncelli or Kakra) for clarification.
For more details about the GPLv2 license, see [https://www.gnu.org/licenses/old-licenses/gpl-2.0.html](https://www.gnu.org/licenses/old-licenses/gpl-2.0.html).

View File

@ -0,0 +1,386 @@
From 5a784f33bc19b80162bd6f9d988282c9fd3afd04 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:04 +0200
Subject: [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
only data chunk allowed
Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
include/uapi/linux/btrfs_tree.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 1f7a38ec6ac316..48f4fbdd22c440 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -389,6 +389,20 @@ struct btrfs_key {
__u64 offset;
} __attribute__ ((__packed__));
+/* dev_item.type */
+
+/* btrfs chunk allocation hints */
+#define BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT 3
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_DATA (0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_METADATA (1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* 5..7 are unused values */
+
struct btrfs_dev_item {
/* the internal btrfs device id */
__le64 devid;
From 9d6cd46fd8990e210d7626c298f0a61de3e2e0d5 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:05 +0200
Subject: [PATCH 2/4] btrfs: export dev_item.type in
/sys/fs/btrfs/<uuid>/devinfo/<devid>/type
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 74fef1f49c358c..efa260bab9442b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1761,6 +1761,16 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
}
BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+ struct kobj_attribute *a, char *buf)
+{
+ struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+ devid_kobj);
+
+ return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
/*
* Information about one device.
*
@@ -1774,6 +1784,7 @@ static struct attribute *devid_attrs[] = {
BTRFS_ATTR_PTR(devid, replace_target),
BTRFS_ATTR_PTR(devid, scrub_speed_max),
BTRFS_ATTR_PTR(devid, writeable),
+ BTRFS_ATTR_PTR(devid, type),
NULL
};
ATTRIBUTE_GROUPS(devid);
From 27fe50aca5ccae65d616c36c0dc82bc4314b5f13 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:06 +0200
Subject: [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/sysfs.c | 56 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 2 +-
fs/btrfs/volumes.h | 2 ++
3 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index efa260bab9442b..43a15b9fe50166 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1769,7 +1769,61 @@ static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
}
-BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
+static ssize_t btrfs_devinfo_type_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf, size_t len)
+{
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root;
+ struct btrfs_device *device;
+ int ret;
+ struct btrfs_trans_handle *trans;
+
+ u64 type, prev_type;
+
+ device = container_of(kobj, struct btrfs_device, devid_kobj);
+ fs_info = device->fs_info;
+ if (!fs_info)
+ return -EPERM;
+
+ root = fs_info->chunk_root;
+ if (sb_rdonly(fs_info->sb))
+ return -EROFS;
+
+ ret = kstrtou64(buf, 0, &type);
+ if (ret < 0)
+ return -EINVAL;
+
+ /* for now, allow to touch only the 'allocation hint' bits */
+ if (type & ~((1 << BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1))
+ return -EINVAL;
+
+ trans = btrfs_start_transaction(root, 1);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ prev_type = device->type;
+ device->type = type;
+
+ ret = btrfs_update_device(trans, device);
+
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ goto abort;
+ }
+
+ ret = btrfs_commit_transaction(trans);
+ if (ret < 0)
+ goto abort;
+
+ return len;
+abort:
+ device->type = prev_type;
+ return ret;
+}
+BTRFS_ATTR_RW(devid, type, btrfs_devinfo_type_show, btrfs_devinfo_type_store);
/*
* Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 635f45f1a2ef80..435162d1531185 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2841,7 +2841,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;
}
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 099def5613b875..a7e52e9d7716af 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -755,6 +755,8 @@ int btrfs_bg_type_to_factor(u64 flags);
const char *btrfs_bg_type_to_raid_name(u64 flags);
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
From 3eb2fd087d6a0f0e51058bcd397222ee6c8466e6 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:07 +0200
Subject: [PATCH 4/4] btrfs: add allocator_hint mode
When this mode is enabled, the chunk allocation policy is modified as
follow.
Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.
Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/volumes.c | 97 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 1 +
2 files changed, 97 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 435162d1531185..1b764fddd8edb1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -174,6 +174,19 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
return BTRFS_BG_FLAG_TO_INDEX(profile);
}
+#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
+#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+ [BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
+ [BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ /* the other values are set to 0 */
+};
+
const char *btrfs_bg_type_to_raid_name(u64 flags)
{
const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5020,13 +5033,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
}
/*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
*/
static int btrfs_cmp_device_info(const void *a, const void *b)
{
const struct btrfs_device_info *di_a = a;
const struct btrfs_device_info *di_b = b;
+ if (di_a->alloc_hint > di_b->alloc_hint)
+ return -1;
+ if (di_a->alloc_hint < di_b->alloc_hint)
+ return 1;
if (di_a->max_avail > di_b->max_avail)
return -1;
if (di_a->max_avail < di_b->max_avail)
@@ -5179,6 +5197,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
int ndevs = 0;
u64 max_avail;
u64 dev_offset;
+ int hint;
+ int i;
/*
* in the first pass through the devices list, we gather information
@@ -5231,16 +5251,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
devices_info[ndevs].max_avail = max_avail;
devices_info[ndevs].total_avail = total_avail;
devices_info[ndevs].dev = device;
+
+ if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+ (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+ /*
+ * if mixed bg set all the alloc_hint
+ * fields to the same value, so the sorting
+ * is not affected
+ */
+ devices_info[ndevs].alloc_hint = 0;
+ } else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_METADATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (data disk
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ } else { /* BTRFS_BLOCK_GROUP_METADATA */
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_DATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (metadata hint
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ }
+
++ndevs;
}
ctl->ndevs = ndevs;
+ /*
+ * no devices available
+ */
+ if (!ndevs)
+ return 0;
+
/*
* now sort the devices by hole size / available space
*/
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
btrfs_cmp_device_info, NULL);
+ /*
+ * select the minimum set of disks grouped by hint that
+ * can host the chunk
+ */
+ ndevs = 0;
+ while (ndevs < ctl->ndevs) {
+ hint = devices_info[ndevs++].alloc_hint;
+ while (ndevs < ctl->ndevs &&
+ devices_info[ndevs].alloc_hint == hint)
+ ndevs++;
+ if (ndevs >= ctl->devs_min)
+ break;
+ }
+
+ BUG_ON(ndevs > ctl->ndevs);
+ ctl->ndevs = ndevs;
+
+ /*
+ * the next layers require the devices_info ordered by
+ * max_avail. If we are returing two (or more) different
+ * group of alloc_hint, this is not always true. So sort
+ * these gain.
+ */
+
+ for (i = 0 ; i < ndevs ; i++)
+ devices_info[i].alloc_hint = 0;
+
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+
return 0;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index a7e52e9d7716af..7ab1447cad0459 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -519,6 +519,7 @@ struct btrfs_device_info {
u64 dev_offset;
u64 max_avail;
u64 total_avail;
+ int alloc_hint;
};
struct btrfs_raid_attr {

View File

@ -0,0 +1,439 @@
From 5e49c78f38cc7f5b7ec012021c8422c1db98ef7e Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:04 +0200
Subject: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
only data chunk allowed
Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
include/uapi/linux/btrfs_tree.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc29d273845d84..71c6135dc7cfb2 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -578,6 +578,20 @@ struct btrfs_node {
struct btrfs_key_ptr ptrs[];
} __attribute__ ((__packed__));
+/* dev_item.type */
+
+/* btrfs chunk allocation hints */
+#define BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT 3
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_DATA (0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_METADATA (1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* 5..7 are unused values */
+
struct btrfs_dev_item {
/* the internal btrfs device id */
__le64 devid;
From 160344ae9ae37b32593adc43716172c37b0a734c Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:05 +0200
Subject: [PATCH 2/5] btrfs: export dev_item.type in
/sys/fs/btrfs/<uuid>/devinfo/<devid>/type
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 03926ad467c919..fe07a7cbcf74c4 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1972,6 +1972,16 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
}
BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+ struct kobj_attribute *a, char *buf)
+{
+ struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+ devid_kobj);
+
+ return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
/*
* Information about one device.
*
@@ -1985,6 +1995,7 @@ static struct attribute *devid_attrs[] = {
BTRFS_ATTR_PTR(devid, replace_target),
BTRFS_ATTR_PTR(devid, scrub_speed_max),
BTRFS_ATTR_PTR(devid, writeable),
+ BTRFS_ATTR_PTR(devid, type),
NULL
};
ATTRIBUTE_GROUPS(devid);
From 29637f2e3a69fe77a8097bd772a8a7803b9ec576 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:06 +0200
Subject: [PATCH 3/5] btrfs: change the DEV_ITEM 'type' field via sysfs
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/sysfs.c | 56 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 2 +-
fs/btrfs/volumes.h | 2 ++
3 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index fe07a7cbcf74c4..3675d961b39a2a 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1980,7 +1980,61 @@ static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
}
-BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
+static ssize_t btrfs_devinfo_type_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf, size_t len)
+{
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root;
+ struct btrfs_device *device;
+ int ret;
+ struct btrfs_trans_handle *trans;
+
+ u64 type, prev_type;
+
+ device = container_of(kobj, struct btrfs_device, devid_kobj);
+ fs_info = device->fs_info;
+ if (!fs_info)
+ return -EPERM;
+
+ root = fs_info->chunk_root;
+ if (sb_rdonly(fs_info->sb))
+ return -EROFS;
+
+ ret = kstrtou64(buf, 0, &type);
+ if (ret < 0)
+ return -EINVAL;
+
+ /* for now, allow to touch only the 'allocation hint' bits */
+ if (type & ~((1 << BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1))
+ return -EINVAL;
+
+ trans = btrfs_start_transaction(root, 1);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ prev_type = device->type;
+ device->type = type;
+
+ ret = btrfs_update_device(trans, device);
+
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ goto abort;
+ }
+
+ ret = btrfs_commit_transaction(trans);
+ if (ret < 0)
+ goto abort;
+
+ return len;
+abort:
+ device->type = prev_type;
+ return ret;
+}
+BTRFS_ATTR_RW(devid, type, btrfs_devinfo_type_show, btrfs_devinfo_type_store);
/*
* Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index eb51b609190fb5..620a9ea74e7558 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2882,7 +2882,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;
}
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 4481575dd70f35..7bb14d51bffc58 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -836,6 +836,8 @@ int btrfs_bg_type_to_factor(u64 flags);
const char *btrfs_bg_type_to_raid_name(u64 flags);
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
const u8 *btrfs_sb_fsid_ptr(const struct btrfs_super_block *sb);
From 970b99e160487e9765b6e7db9f8a89a96ce79811 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:07 +0200
Subject: [PATCH 4/5] btrfs: add allocator_hint mode
When this mode is enabled, the chunk allocation policy is modified as
follow.
Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.
Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/volumes.c | 97 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 1 +
2 files changed, 97 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 620a9ea74e7558..e66700fc8dcd4e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -184,6 +184,19 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
return BTRFS_BG_FLAG_TO_INDEX(profile);
}
+#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
+#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+ [BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
+ [BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ /* the other values are set to 0 */
+};
+
const char *btrfs_bg_type_to_raid_name(u64 flags)
{
const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5022,13 +5035,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
}
/*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
*/
static int btrfs_cmp_device_info(const void *a, const void *b)
{
const struct btrfs_device_info *di_a = a;
const struct btrfs_device_info *di_b = b;
+ if (di_a->alloc_hint > di_b->alloc_hint)
+ return -1;
+ if (di_a->alloc_hint < di_b->alloc_hint)
+ return 1;
if (di_a->max_avail > di_b->max_avail)
return -1;
if (di_a->max_avail < di_b->max_avail)
@@ -5181,6 +5199,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
int ndevs = 0;
u64 max_avail;
u64 dev_offset;
+ int hint;
+ int i;
/*
* in the first pass through the devices list, we gather information
@@ -5233,16 +5253,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
devices_info[ndevs].max_avail = max_avail;
devices_info[ndevs].total_avail = total_avail;
devices_info[ndevs].dev = device;
+
+ if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+ (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+ /*
+ * if mixed bg set all the alloc_hint
+ * fields to the same value, so the sorting
+ * is not affected
+ */
+ devices_info[ndevs].alloc_hint = 0;
+ } else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_METADATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (data disk
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ } else { /* BTRFS_BLOCK_GROUP_METADATA */
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_DATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (metadata hint
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ }
+
++ndevs;
}
ctl->ndevs = ndevs;
+ /*
+ * no devices available
+ */
+ if (!ndevs)
+ return 0;
+
/*
* now sort the devices by hole size / available space
*/
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
btrfs_cmp_device_info, NULL);
+ /*
+ * select the minimum set of disks grouped by hint that
+ * can host the chunk
+ */
+ ndevs = 0;
+ while (ndevs < ctl->ndevs) {
+ hint = devices_info[ndevs++].alloc_hint;
+ while (ndevs < ctl->ndevs &&
+ devices_info[ndevs].alloc_hint == hint)
+ ndevs++;
+ if (ndevs >= ctl->devs_min)
+ break;
+ }
+
+ BUG_ON(ndevs > ctl->ndevs);
+ ctl->ndevs = ndevs;
+
+ /*
+ * the next layers require the devices_info ordered by
+ * max_avail. If we are returing two (or more) different
+ * group of alloc_hint, this is not always true. So sort
+ * these gain.
+ */
+
+ for (i = 0 ; i < ndevs ; i++)
+ devices_info[i].alloc_hint = 0;
+
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+
return 0;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 7bb14d51bffc58..f3c5437e270a22 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -565,6 +565,7 @@ struct btrfs_device_info {
u64 dev_offset;
u64 max_avail;
u64 total_avail;
+ int alloc_hint;
};
struct btrfs_raid_attr {
From 1c1f2e27d3055b7721468c6980479a043f48e2b3 Mon Sep 17 00:00:00 2001
From: Kai Krakow <kk@netactive.de>
Date: Thu, 27 Jun 2024 20:05:58 +0200
Subject: [PATCH 5/5] btrfs: add allocator_hint for no allocation preferred
This is useful where you want to prevent new allocations of chunks on a
disk which is going to removed from the pool anyways, e.g. due to bad
blocks or because it's slow.
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/volumes.c | 6 +++++-
include/uapi/linux/btrfs_tree.h | 2 ++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e66700fc8dcd4e..c6aa93fae9aa65 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -194,6 +194,7 @@ static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_NONE] = 99,
/* the other values are set to 0 */
};
@@ -5289,7 +5290,10 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
* sort also by hint (metadata hint
* higher priority)
*/
- devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ if (hint == BTRFS_DEV_ALLOCATION_PREFERRED_NONE)
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ else
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
}
++ndevs;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 71c6135dc7cfb2..92bcc59b129a97 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -590,6 +590,8 @@ struct btrfs_node {
#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
/* only data chunk allowed */
#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* preferred no chunk, but chunks allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_NONE (4ULL)
/* 5..7 are unused values */
struct btrfs_dev_item {

View File

@ -0,0 +1,386 @@
From 1946e35846bf03304694504d2d85a9565ca2b558 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:04 +0200
Subject: [PATCH 1/4] btrfs: add flags to give an hint to the chunk allocator
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
only data chunk allowed
Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
include/uapi/linux/btrfs_tree.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc3c32186d7eb1..0261ed5f0adabd 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -557,6 +557,20 @@ struct btrfs_node {
struct btrfs_key_ptr ptrs[];
} __attribute__ ((__packed__));
+/* dev_item.type */
+
+/* btrfs chunk allocation hints */
+#define BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT 3
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_DATA (0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_METADATA (1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* 5..7 are unused values */
+
struct btrfs_dev_item {
/* the internal btrfs device id */
__le64 devid;
From 1072ba33d70ebefd2af5274152b8c16649476517 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:05 +0200
Subject: [PATCH 2/4] btrfs: export dev_item.type in
/sys/fs/btrfs/<uuid>/devinfo/<devid>/type
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b1d1ac25237b7a..561a2a7d78b5ed 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1817,6 +1817,16 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
}
BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+ struct kobj_attribute *a, char *buf)
+{
+ struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+ devid_kobj);
+
+ return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
/*
* Information about one device.
*
@@ -1830,6 +1840,7 @@ static struct attribute *devid_attrs[] = {
BTRFS_ATTR_PTR(devid, replace_target),
BTRFS_ATTR_PTR(devid, scrub_speed_max),
BTRFS_ATTR_PTR(devid, writeable),
+ BTRFS_ATTR_PTR(devid, type),
NULL
};
ATTRIBUTE_GROUPS(devid);
From 95e2eb2e2172f328d382fb02cd10cd54196e713c Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:06 +0200
Subject: [PATCH 3/4] btrfs: change the DEV_ITEM 'type' field via sysfs
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/sysfs.c | 56 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 2 +-
fs/btrfs/volumes.h | 2 ++
3 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 561a2a7d78b5ed..9246161bfdecc7 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1825,7 +1825,61 @@ static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
}
-BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
+static ssize_t btrfs_devinfo_type_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf, size_t len)
+{
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root;
+ struct btrfs_device *device;
+ int ret;
+ struct btrfs_trans_handle *trans;
+
+ u64 type, prev_type;
+
+ device = container_of(kobj, struct btrfs_device, devid_kobj);
+ fs_info = device->fs_info;
+ if (!fs_info)
+ return -EPERM;
+
+ root = fs_info->chunk_root;
+ if (sb_rdonly(fs_info->sb))
+ return -EROFS;
+
+ ret = kstrtou64(buf, 0, &type);
+ if (ret < 0)
+ return -EINVAL;
+
+ /* for now, allow to touch only the 'allocation hint' bits */
+ if (type & ~((1 << BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1))
+ return -EINVAL;
+
+ trans = btrfs_start_transaction(root, 1);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ prev_type = device->type;
+ device->type = type;
+
+ ret = btrfs_update_device(trans, device);
+
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ goto abort;
+ }
+
+ ret = btrfs_commit_transaction(trans);
+ if (ret < 0)
+ goto abort;
+
+ return len;
+abort:
+ device->type = prev_type;
+ return ret;
+}
+BTRFS_ATTR_RW(devid, type, btrfs_devinfo_type_show, btrfs_devinfo_type_store);
/*
* Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b9ef6f54635ca5..4eef6fb1d67132 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2856,7 +2856,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;
}
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2128a032c3b74d..917d75b7b983e3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -745,6 +745,8 @@ int btrfs_bg_type_to_factor(u64 flags);
const char *btrfs_bg_type_to_raid_name(u64 flags);
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
u8 *btrfs_sb_fsid_ptr(struct btrfs_super_block *sb);
From 01a6de57d05bf598d426b737ae0c08917cd4f677 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:07 +0200
Subject: [PATCH 4/4] btrfs: add allocator_hint mode
When this mode is enabled, the chunk allocation policy is modified as
follow.
Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.
Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/volumes.c | 97 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 1 +
2 files changed, 97 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4eef6fb1d67132..959976d34ffa92 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -174,6 +174,19 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
return BTRFS_BG_FLAG_TO_INDEX(profile);
}
+#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
+#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+ [BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
+ [BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ /* the other values are set to 0 */
+};
+
const char *btrfs_bg_type_to_raid_name(u64 flags)
{
const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5034,13 +5047,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
}
/*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
*/
static int btrfs_cmp_device_info(const void *a, const void *b)
{
const struct btrfs_device_info *di_a = a;
const struct btrfs_device_info *di_b = b;
+ if (di_a->alloc_hint > di_b->alloc_hint)
+ return -1;
+ if (di_a->alloc_hint < di_b->alloc_hint)
+ return 1;
if (di_a->max_avail > di_b->max_avail)
return -1;
if (di_a->max_avail < di_b->max_avail)
@@ -5193,6 +5211,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
int ndevs = 0;
u64 max_avail;
u64 dev_offset;
+ int hint;
+ int i;
/*
* in the first pass through the devices list, we gather information
@@ -5245,16 +5265,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
devices_info[ndevs].max_avail = max_avail;
devices_info[ndevs].total_avail = total_avail;
devices_info[ndevs].dev = device;
+
+ if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+ (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+ /*
+ * if mixed bg set all the alloc_hint
+ * fields to the same value, so the sorting
+ * is not affected
+ */
+ devices_info[ndevs].alloc_hint = 0;
+ } else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_METADATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (data disk
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ } else { /* BTRFS_BLOCK_GROUP_METADATA */
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_DATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (metadata hint
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ }
+
++ndevs;
}
ctl->ndevs = ndevs;
+ /*
+ * no devices available
+ */
+ if (!ndevs)
+ return 0;
+
/*
* now sort the devices by hole size / available space
*/
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
btrfs_cmp_device_info, NULL);
+ /*
+ * select the minimum set of disks grouped by hint that
+ * can host the chunk
+ */
+ ndevs = 0;
+ while (ndevs < ctl->ndevs) {
+ hint = devices_info[ndevs++].alloc_hint;
+ while (ndevs < ctl->ndevs &&
+ devices_info[ndevs].alloc_hint == hint)
+ ndevs++;
+ if (ndevs >= ctl->devs_min)
+ break;
+ }
+
+ BUG_ON(ndevs > ctl->ndevs);
+ ctl->ndevs = ndevs;
+
+ /*
+ * the next layers require the devices_info ordered by
+ * max_avail. If we are returing two (or more) different
+ * group of alloc_hint, this is not always true. So sort
+ * these gain.
+ */
+
+ for (i = 0 ; i < ndevs ; i++)
+ devices_info[i].alloc_hint = 0;
+
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+
return 0;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 917d75b7b983e3..12777e8e2769c8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -493,6 +493,7 @@ struct btrfs_device_info {
u64 dev_offset;
u64 max_avail;
u64 total_avail;
+ int alloc_hint;
};
struct btrfs_raid_attr {

View File

@ -0,0 +1,437 @@
From 1946e35846bf03304694504d2d85a9565ca2b558 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:04 +0200
Subject: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
only data chunk allowed
Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
include/uapi/linux/btrfs_tree.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc3c32186d7eb1..0261ed5f0adabd 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -557,6 +557,20 @@ struct btrfs_node {
struct btrfs_key_ptr ptrs[];
} __attribute__ ((__packed__));
+/* dev_item.type */
+
+/* btrfs chunk allocation hints */
+#define BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT 3
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_DATA (0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_METADATA (1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* 5..7 are unused values */
+
struct btrfs_dev_item {
/* the internal btrfs device id */
__le64 devid;
From 1072ba33d70ebefd2af5274152b8c16649476517 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:05 +0200
Subject: [PATCH 2/5] btrfs: export dev_item.type in
/sys/fs/btrfs/<uuid>/devinfo/<devid>/type
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index b1d1ac25237b7a..561a2a7d78b5ed 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1817,6 +1817,16 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
}
BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+ struct kobj_attribute *a, char *buf)
+{
+ struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+ devid_kobj);
+
+ return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
/*
* Information about one device.
*
@@ -1830,6 +1840,7 @@ static struct attribute *devid_attrs[] = {
BTRFS_ATTR_PTR(devid, replace_target),
BTRFS_ATTR_PTR(devid, scrub_speed_max),
BTRFS_ATTR_PTR(devid, writeable),
+ BTRFS_ATTR_PTR(devid, type),
NULL
};
ATTRIBUTE_GROUPS(devid);
From 95e2eb2e2172f328d382fb02cd10cd54196e713c Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:06 +0200
Subject: [PATCH 3/5] btrfs: change the DEV_ITEM 'type' field via sysfs
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/sysfs.c | 56 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 2 +-
fs/btrfs/volumes.h | 2 ++
3 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 561a2a7d78b5ed..9246161bfdecc7 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1825,7 +1825,61 @@ static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
}
-BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
+static ssize_t btrfs_devinfo_type_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf, size_t len)
+{
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root;
+ struct btrfs_device *device;
+ int ret;
+ struct btrfs_trans_handle *trans;
+
+ u64 type, prev_type;
+
+ device = container_of(kobj, struct btrfs_device, devid_kobj);
+ fs_info = device->fs_info;
+ if (!fs_info)
+ return -EPERM;
+
+ root = fs_info->chunk_root;
+ if (sb_rdonly(fs_info->sb))
+ return -EROFS;
+
+ ret = kstrtou64(buf, 0, &type);
+ if (ret < 0)
+ return -EINVAL;
+
+ /* for now, allow to touch only the 'allocation hint' bits */
+ if (type & ~((1 << BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1))
+ return -EINVAL;
+
+ trans = btrfs_start_transaction(root, 1);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ prev_type = device->type;
+ device->type = type;
+
+ ret = btrfs_update_device(trans, device);
+
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ goto abort;
+ }
+
+ ret = btrfs_commit_transaction(trans);
+ if (ret < 0)
+ goto abort;
+
+ return len;
+abort:
+ device->type = prev_type;
+ return ret;
+}
+BTRFS_ATTR_RW(devid, type, btrfs_devinfo_type_show, btrfs_devinfo_type_store);
/*
* Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b9ef6f54635ca5..4eef6fb1d67132 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2856,7 +2856,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;
}
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2128a032c3b74d..917d75b7b983e3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -745,6 +745,8 @@ int btrfs_bg_type_to_factor(u64 flags);
const char *btrfs_bg_type_to_raid_name(u64 flags);
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
u8 *btrfs_sb_fsid_ptr(struct btrfs_super_block *sb);
From 01a6de57d05bf598d426b737ae0c08917cd4f677 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:07 +0200
Subject: [PATCH 4/5] btrfs: add allocator_hint mode
When this mode is enabled, the chunk allocation policy is modified as
follow.
Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.
Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/volumes.c | 97 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 1 +
2 files changed, 97 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4eef6fb1d67132..959976d34ffa92 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -174,6 +174,19 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
return BTRFS_BG_FLAG_TO_INDEX(profile);
}
+#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
+#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+ [BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
+ [BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ /* the other values are set to 0 */
+};
+
const char *btrfs_bg_type_to_raid_name(u64 flags)
{
const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5034,13 +5047,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
}
/*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
*/
static int btrfs_cmp_device_info(const void *a, const void *b)
{
const struct btrfs_device_info *di_a = a;
const struct btrfs_device_info *di_b = b;
+ if (di_a->alloc_hint > di_b->alloc_hint)
+ return -1;
+ if (di_a->alloc_hint < di_b->alloc_hint)
+ return 1;
if (di_a->max_avail > di_b->max_avail)
return -1;
if (di_a->max_avail < di_b->max_avail)
@@ -5193,6 +5211,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
int ndevs = 0;
u64 max_avail;
u64 dev_offset;
+ int hint;
+ int i;
/*
* in the first pass through the devices list, we gather information
@@ -5245,16 +5265,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
devices_info[ndevs].max_avail = max_avail;
devices_info[ndevs].total_avail = total_avail;
devices_info[ndevs].dev = device;
+
+ if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+ (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+ /*
+ * if mixed bg set all the alloc_hint
+ * fields to the same value, so the sorting
+ * is not affected
+ */
+ devices_info[ndevs].alloc_hint = 0;
+ } else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_METADATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (data disk
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ } else { /* BTRFS_BLOCK_GROUP_METADATA */
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_DATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (metadata hint
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ }
+
++ndevs;
}
ctl->ndevs = ndevs;
+ /*
+ * no devices available
+ */
+ if (!ndevs)
+ return 0;
+
/*
* now sort the devices by hole size / available space
*/
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
btrfs_cmp_device_info, NULL);
+ /*
+ * select the minimum set of disks grouped by hint that
+ * can host the chunk
+ */
+ ndevs = 0;
+ while (ndevs < ctl->ndevs) {
+ hint = devices_info[ndevs++].alloc_hint;
+ while (ndevs < ctl->ndevs &&
+ devices_info[ndevs].alloc_hint == hint)
+ ndevs++;
+ if (ndevs >= ctl->devs_min)
+ break;
+ }
+
+ BUG_ON(ndevs > ctl->ndevs);
+ ctl->ndevs = ndevs;
+
+ /*
+ * the next layers require the devices_info ordered by
+ * max_avail. If we are returing two (or more) different
+ * group of alloc_hint, this is not always true. So sort
+ * these gain.
+ */
+
+ for (i = 0 ; i < ndevs ; i++)
+ devices_info[i].alloc_hint = 0;
+
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+
return 0;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 917d75b7b983e3..12777e8e2769c8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -493,6 +493,7 @@ struct btrfs_device_info {
u64 dev_offset;
u64 max_avail;
u64 total_avail;
+ int alloc_hint;
};
struct btrfs_raid_attr {
From 2b554bdebe49d0a673d858acdae673051cf735e1 Mon Sep 17 00:00:00 2001
From: Kai Krakow <kk@netactive.de>
Date: Thu, 27 Jun 2024 20:05:58 +0200
Subject: [PATCH 5/5] btrfs: add allocator_hint for no allocation preferred
This is useful where you want to prevent new allocations of chunks on a
disk which is going to removed from the pool anyways, e.g. due to bad
blocks or because it's slow.
---
fs/btrfs/volumes.c | 6 +++++-
include/uapi/linux/btrfs_tree.h | 2 ++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 959976d34ffa92..0e269b3f11185f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -184,6 +184,7 @@ static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_NONE] = 99,
/* the other values are set to 0 */
};
@@ -5301,7 +5302,10 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
* sort also by hint (metadata hint
* higher priority)
*/
- devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ if (hint == BTRFS_DEV_ALLOCATION_PREFERRED_NONE)
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ else
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
}
++ndevs;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 0261ed5f0adabd..9946e497dc8367 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -569,6 +569,8 @@ struct btrfs_node {
#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
/* only data chunk allowed */
#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* preferred no chunk, but chunks allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_NONE (4ULL)
/* 5..7 are unused values */
struct btrfs_dev_item {

View File

@ -0,0 +1,581 @@
From 72456935fc3fd570c55d0aa1a46f663b9f4032cb Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:04 +0200
Subject: [PATCH 1/6] btrfs: add flags to give an hint to the chunk allocator
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
only data chunk allowed
Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
include/uapi/linux/btrfs_tree.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc3c32186d7eb1..0261ed5f0adabd 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -557,6 +557,20 @@ struct btrfs_node {
struct btrfs_key_ptr ptrs[];
} __attribute__ ((__packed__));
+/* dev_item.type */
+
+/* btrfs chunk allocation hints */
+#define BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT 3
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_DATA (0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_METADATA (1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* 5..7 are unused values */
+
struct btrfs_dev_item {
/* the internal btrfs device id */
__le64 devid;
From 125bbd9c1c917ae2ca4ab11e3c15d11be49f6012 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:05 +0200
Subject: [PATCH 2/6] btrfs: export dev_item.type in
/sys/fs/btrfs/<uuid>/devinfo/<devid>/type
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index c9198723e4cb73..4364707be8b5ad 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1821,6 +1821,16 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
}
BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+ struct kobj_attribute *a, char *buf)
+{
+ struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+ devid_kobj);
+
+ return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
/*
* Information about one device.
*
@@ -1834,6 +1844,7 @@ static struct attribute *devid_attrs[] = {
BTRFS_ATTR_PTR(devid, replace_target),
BTRFS_ATTR_PTR(devid, scrub_speed_max),
BTRFS_ATTR_PTR(devid, writeable),
+ BTRFS_ATTR_PTR(devid, type),
NULL
};
ATTRIBUTE_GROUPS(devid);
From d7cb7984ed3ccbb9dc1123cffd3ac3f18761631d Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:06 +0200
Subject: [PATCH 3/6] btrfs: change the DEV_ITEM 'type' field via sysfs
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/sysfs.c | 56 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 2 +-
fs/btrfs/volumes.h | 2 ++
3 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 4364707be8b5ad..1079e5574dd4c5 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1829,7 +1829,61 @@ static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
}
-BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
+static ssize_t btrfs_devinfo_type_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf, size_t len)
+{
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root;
+ struct btrfs_device *device;
+ int ret;
+ struct btrfs_trans_handle *trans;
+
+ u64 type, prev_type;
+
+ device = container_of(kobj, struct btrfs_device, devid_kobj);
+ fs_info = device->fs_info;
+ if (!fs_info)
+ return -EPERM;
+
+ root = fs_info->chunk_root;
+ if (sb_rdonly(fs_info->sb))
+ return -EROFS;
+
+ ret = kstrtou64(buf, 0, &type);
+ if (ret < 0)
+ return -EINVAL;
+
+ /* for now, allow to touch only the 'allocation hint' bits */
+ if (type & ~((1 << BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1))
+ return -EINVAL;
+
+ trans = btrfs_start_transaction(root, 1);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ prev_type = device->type;
+ device->type = type;
+
+ ret = btrfs_update_device(trans, device);
+
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ goto abort;
+ }
+
+ ret = btrfs_commit_transaction(trans);
+ if (ret < 0)
+ goto abort;
+
+ return len;
+abort:
+ device->type = prev_type;
+ return ret;
+}
+BTRFS_ATTR_RW(devid, type, btrfs_devinfo_type_show, btrfs_devinfo_type_store);
/*
* Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d2285c9726e7b1..7778f55e659325 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2863,7 +2863,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;
}
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2128a032c3b74d..917d75b7b983e3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -745,6 +745,8 @@ int btrfs_bg_type_to_factor(u64 flags);
const char *btrfs_bg_type_to_raid_name(u64 flags);
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
u8 *btrfs_sb_fsid_ptr(struct btrfs_super_block *sb);
From c73c4edad5c1422e4baa991a5c40508e09dfb20e Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:07 +0200
Subject: [PATCH 4/6] btrfs: add allocator_hint mode
When this mode is enabled, the chunk allocation policy is modified as
follow.
Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.
Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/volumes.c | 97 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 1 +
2 files changed, 97 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7778f55e659325..a5bcad391276a5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -174,6 +174,19 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
return BTRFS_BG_FLAG_TO_INDEX(profile);
}
+#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
+#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+ [BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
+ [BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ /* the other values are set to 0 */
+};
+
const char *btrfs_bg_type_to_raid_name(u64 flags)
{
const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5053,13 +5066,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
}
/*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
*/
static int btrfs_cmp_device_info(const void *a, const void *b)
{
const struct btrfs_device_info *di_a = a;
const struct btrfs_device_info *di_b = b;
+ if (di_a->alloc_hint > di_b->alloc_hint)
+ return -1;
+ if (di_a->alloc_hint < di_b->alloc_hint)
+ return 1;
if (di_a->max_avail > di_b->max_avail)
return -1;
if (di_a->max_avail < di_b->max_avail)
@@ -5212,6 +5230,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
int ndevs = 0;
u64 max_avail;
u64 dev_offset;
+ int hint;
+ int i;
/*
* in the first pass through the devices list, we gather information
@@ -5264,16 +5284,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
devices_info[ndevs].max_avail = max_avail;
devices_info[ndevs].total_avail = total_avail;
devices_info[ndevs].dev = device;
+
+ if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+ (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+ /*
+ * if mixed bg set all the alloc_hint
+ * fields to the same value, so the sorting
+ * is not affected
+ */
+ devices_info[ndevs].alloc_hint = 0;
+ } else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_METADATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (data disk
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ } else { /* BTRFS_BLOCK_GROUP_METADATA */
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_DATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (metadata hint
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ }
+
++ndevs;
}
ctl->ndevs = ndevs;
+ /*
+ * no devices available
+ */
+ if (!ndevs)
+ return 0;
+
/*
* now sort the devices by hole size / available space
*/
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
btrfs_cmp_device_info, NULL);
+ /*
+ * select the minimum set of disks grouped by hint that
+ * can host the chunk
+ */
+ ndevs = 0;
+ while (ndevs < ctl->ndevs) {
+ hint = devices_info[ndevs++].alloc_hint;
+ while (ndevs < ctl->ndevs &&
+ devices_info[ndevs].alloc_hint == hint)
+ ndevs++;
+ if (ndevs >= ctl->devs_min)
+ break;
+ }
+
+ BUG_ON(ndevs > ctl->ndevs);
+ ctl->ndevs = ndevs;
+
+ /*
+ * the next layers require the devices_info ordered by
+ * max_avail. If we are returing two (or more) different
+ * group of alloc_hint, this is not always true. So sort
+ * these gain.
+ */
+
+ for (i = 0 ; i < ndevs ; i++)
+ devices_info[i].alloc_hint = 0;
+
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+
return 0;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 917d75b7b983e3..12777e8e2769c8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -493,6 +493,7 @@ struct btrfs_device_info {
u64 dev_offset;
u64 max_avail;
u64 total_avail;
+ int alloc_hint;
};
struct btrfs_raid_attr {
From 0d266c390e9bb843c3b63747dab2e7c80979e7b2 Mon Sep 17 00:00:00 2001
From: Kai Krakow <kk@netactive.de>
Date: Thu, 27 Jun 2024 20:05:58 +0200
Subject: [PATCH 5/6] btrfs: add allocator_hint for no allocation preferred
This is useful where you want to prevent new allocations of chunks on a
disk which is going to removed from the pool anyways, e.g. due to bad
blocks or because it's slow.
---
fs/btrfs/volumes.c | 6 +++++-
include/uapi/linux/btrfs_tree.h | 2 ++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a5bcad391276a5..ffcb6ee7c50ee2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -184,6 +184,7 @@ static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_NONE] = 99,
/* the other values are set to 0 */
};
@@ -5320,7 +5321,10 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
* sort also by hint (metadata hint
* higher priority)
*/
- devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ if (hint == BTRFS_DEV_ALLOCATION_PREFERRED_NONE)
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ else
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
}
++ndevs;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 0261ed5f0adabd..9946e497dc8367 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -569,6 +569,8 @@ struct btrfs_node {
#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
/* only data chunk allowed */
#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* preferred no chunk, but chunks allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_NONE (4ULL)
/* 5..7 are unused values */
struct btrfs_dev_item {
From 76e1b3eeb9a99149bf171453f2f7ced1040e3c41 Mon Sep 17 00:00:00 2001
From: Qu Wenruo <wqu@suse.com>
Date: Mon, 15 Jul 2024 16:07:07 +0930
Subject: [PATCH 6/6] btrfs: tree-checker: validate dref root and objectid
Not yet upstreamed.
[CORRUPTION]
There is a bug report that btrfs flips RO due to a corruption in the
extent tree, the involved dumps looks like this:
item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79
extent refs 3 gen 3678544 flags 1
ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1
ref#1: shared data backref parent 1947073626112 count 1
ref#2: shared data backref parent 1156030103552 count 1
BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189
BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2
[CAUSE]
The corrupted entry is ref#0 of item 188.
The root number 13835058055282163977 is beyond the upper limit for root
items (the current limit is 1 << 48), and the objectid also looks
suspicious.
Only the offset and count is correct.
[ENHANCEMENT]
Although it's still unknown why we have such many bytes corrupted
randomly, we can still enhance the tree-checker for data backrefs by:
- Validate the root value
For now there should only be 3 types of roots can have data backref:
* subvolume trees
* data reloc trees
* root tree
Only for v1 space cache
- validate the objectid value
The objectid should be a valid inode number.
Hopefully we can catch such problem in the future with the new checkers.
Reported-by: Kai Krakow <hurikhan77@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#t
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
fs/btrfs/tree-checker.c | 47 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)
diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 5d6cfa618dc475..f14825f3d4e822 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -1265,6 +1265,19 @@ static void extent_err(const struct extent_buffer *eb, int slot,
va_end(args);
}
+static bool is_valid_dref_root(u64 rootid)
+{
+ /*
+ * The following tree root objectids are allowed to have a data backref:
+ * - subvolume trees
+ * - data reloc tree
+ * - tree root
+ * For v1 space cache
+ */
+ return is_fstree(rootid) || rootid == BTRFS_DATA_RELOC_TREE_OBJECTID ||
+ rootid == BTRFS_ROOT_TREE_OBJECTID;
+}
+
static int check_extent_item(struct extent_buffer *leaf,
struct btrfs_key *key, int slot,
struct btrfs_key *prev_key)
@@ -1417,6 +1430,8 @@ static int check_extent_item(struct extent_buffer *leaf,
struct btrfs_extent_data_ref *dref;
struct btrfs_shared_data_ref *sref;
u64 seq;
+ u64 dref_root;
+ u64 dref_objectid;
u64 dref_offset;
u64 inline_offset;
u8 inline_type;
@@ -1460,11 +1475,26 @@ static int check_extent_item(struct extent_buffer *leaf,
*/
case BTRFS_EXTENT_DATA_REF_KEY:
dref = (struct btrfs_extent_data_ref *)(&iref->offset);
+ dref_root = btrfs_extent_data_ref_root(leaf, dref);
+ dref_objectid = btrfs_extent_data_ref_objectid(leaf, dref);
dref_offset = btrfs_extent_data_ref_offset(leaf, dref);
seq = hash_extent_data_ref(
btrfs_extent_data_ref_root(leaf, dref),
btrfs_extent_data_ref_objectid(leaf, dref),
btrfs_extent_data_ref_offset(leaf, dref));
+ if (unlikely(!is_valid_dref_root(dref_root))) {
+ extent_err(leaf, slot,
+ "invalid data ref root value %llu",
+ dref_root);
+ return -EUCLEAN;
+ }
+ if (unlikely(dref_objectid < BTRFS_FIRST_FREE_OBJECTID ||
+ dref_objectid > BTRFS_LAST_FREE_OBJECTID)) {
+ extent_err(leaf, slot,
+ "invalid data ref objectid value %llu",
+ dref_root);
+ return -EUCLEAN;
+ }
if (unlikely(!IS_ALIGNED(dref_offset,
fs_info->sectorsize))) {
extent_err(leaf, slot,
@@ -1600,6 +1630,8 @@ static int check_extent_data_ref(struct extent_buffer *leaf,
return -EUCLEAN;
}
for (; ptr < end; ptr += sizeof(*dref)) {
+ u64 root;
+ u64 objectid;
u64 offset;
/*
@@ -1607,7 +1639,22 @@ static int check_extent_data_ref(struct extent_buffer *leaf,
* overflow from the leaf due to hash collisions.
*/
dref = (struct btrfs_extent_data_ref *)ptr;
+ root = btrfs_extent_data_ref_root(leaf, dref);
+ objectid = btrfs_extent_data_ref_objectid(leaf, dref);
offset = btrfs_extent_data_ref_offset(leaf, dref);
+ if (unlikely(!is_valid_dref_root(root))) {
+ extent_err(leaf, slot,
+ "invalid extent data backref root value %llu",
+ root);
+ return -EUCLEAN;
+ }
+ if (unlikely(objectid < BTRFS_FIRST_FREE_OBJECTID ||
+ objectid > BTRFS_LAST_FREE_OBJECTID)) {
+ extent_err(leaf, slot,
+ "invalid extent data backref objectid value %llu",
+ root);
+ return -EUCLEAN;
+ }
if (unlikely(!IS_ALIGNED(offset, leaf->fs_info->sectorsize))) {
extent_err(leaf, slot,
"invalid extent data backref offset, have %llu expect aligned to %u",

View File

@ -0,0 +1,439 @@
From e7913f202653e4141c3e0eae6a34db3d582135af Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:04 +0200
Subject: [PATCH 1/5] btrfs: add flags to give an hint to the chunk allocator
Add the following flags to give an hint about which chunk should be
allocated in which a disk.
The following flags are created:
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA
preferred data chunk, but metadata chunk allowed
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
preferred metadata chunk, but data chunk allowed
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
only metadata chunk allowed
- BTRFS_DEV_ALLOCATION_DATA_ONLY
only data chunk allowed
Signed-off-by: Goffredo Baroncelli <kreijack@inwid.it>
---
include/uapi/linux/btrfs_tree.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fc3c32186d7eb1..0261ed5f0adabd 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -557,6 +557,20 @@ struct btrfs_node {
struct btrfs_key_ptr ptrs[];
} __attribute__ ((__packed__));
+/* dev_item.type */
+
+/* btrfs chunk allocation hints */
+#define BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT 3
+/* preferred data chunk, but metadata chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_DATA (0ULL)
+/* preferred metadata chunk, but data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_METADATA (1ULL)
+/* only metadata chunk are allowed */
+#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
+/* only data chunk allowed */
+#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* 5..7 are unused values */
+
struct btrfs_dev_item {
/* the internal btrfs device id */
__le64 devid;
From e53763394dab09536a53bb0170ec1ac43d0361f3 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:05 +0200
Subject: [PATCH 2/5] btrfs: export dev_item.type in
/sys/fs/btrfs/<uuid>/devinfo/<devid>/type
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/sysfs.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index c9198723e4cb73..4364707be8b5ad 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1821,6 +1821,16 @@ static ssize_t btrfs_devinfo_error_stats_show(struct kobject *kobj,
}
BTRFS_ATTR(devid, error_stats, btrfs_devinfo_error_stats_show);
+static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
+ struct kobj_attribute *a, char *buf)
+{
+ struct btrfs_device *device = container_of(kobj, struct btrfs_device,
+ devid_kobj);
+
+ return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
+}
+BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
/*
* Information about one device.
*
@@ -1834,6 +1844,7 @@ static struct attribute *devid_attrs[] = {
BTRFS_ATTR_PTR(devid, replace_target),
BTRFS_ATTR_PTR(devid, scrub_speed_max),
BTRFS_ATTR_PTR(devid, writeable),
+ BTRFS_ATTR_PTR(devid, type),
NULL
};
ATTRIBUTE_GROUPS(devid);
From e438af32dc5efcccb937b2c2211aa2ae17580d8f Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:06 +0200
Subject: [PATCH 3/5] btrfs: change the DEV_ITEM 'type' field via sysfs
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/sysfs.c | 56 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.c | 2 +-
fs/btrfs/volumes.h | 2 ++
3 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 4364707be8b5ad..1079e5574dd4c5 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -1829,7 +1829,61 @@ static ssize_t btrfs_devinfo_type_show(struct kobject *kobj,
return scnprintf(buf, PAGE_SIZE, "0x%08llx\n", device->type);
}
-BTRFS_ATTR(devid, type, btrfs_devinfo_type_show);
+
+static ssize_t btrfs_devinfo_type_store(struct kobject *kobj,
+ struct kobj_attribute *a,
+ const char *buf, size_t len)
+{
+ struct btrfs_fs_info *fs_info;
+ struct btrfs_root *root;
+ struct btrfs_device *device;
+ int ret;
+ struct btrfs_trans_handle *trans;
+
+ u64 type, prev_type;
+
+ device = container_of(kobj, struct btrfs_device, devid_kobj);
+ fs_info = device->fs_info;
+ if (!fs_info)
+ return -EPERM;
+
+ root = fs_info->chunk_root;
+ if (sb_rdonly(fs_info->sb))
+ return -EROFS;
+
+ ret = kstrtou64(buf, 0, &type);
+ if (ret < 0)
+ return -EINVAL;
+
+ /* for now, allow to touch only the 'allocation hint' bits */
+ if (type & ~((1 << BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1))
+ return -EINVAL;
+
+ trans = btrfs_start_transaction(root, 1);
+ if (IS_ERR(trans))
+ return PTR_ERR(trans);
+
+ prev_type = device->type;
+ device->type = type;
+
+ ret = btrfs_update_device(trans, device);
+
+ if (ret < 0) {
+ btrfs_abort_transaction(trans, ret);
+ btrfs_end_transaction(trans);
+ goto abort;
+ }
+
+ ret = btrfs_commit_transaction(trans);
+ if (ret < 0)
+ goto abort;
+
+ return len;
+abort:
+ device->type = prev_type;
+ return ret;
+}
+BTRFS_ATTR_RW(devid, type, btrfs_devinfo_type_show, btrfs_devinfo_type_store);
/*
* Information about one device.
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d2285c9726e7b1..7778f55e659325 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2863,7 +2863,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
return ret;
}
-static noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
+noinline int btrfs_update_device(struct btrfs_trans_handle *trans,
struct btrfs_device *device)
{
int ret;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2128a032c3b74d..917d75b7b983e3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -745,6 +745,8 @@ int btrfs_bg_type_to_factor(u64 flags);
const char *btrfs_bg_type_to_raid_name(u64 flags);
int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
bool btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
+int btrfs_update_device(struct btrfs_trans_handle *trans,
+ struct btrfs_device *device);
bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr);
u8 *btrfs_sb_fsid_ptr(struct btrfs_super_block *sb);
From 0e46a868a1b099a9420fa4f44f308534de599b08 Mon Sep 17 00:00:00 2001
From: Goffredo Baroncelli <kreijack@inwind.it>
Date: Sun, 24 Oct 2021 17:31:07 +0200
Subject: [PATCH 4/5] btrfs: add allocator_hint mode
When this mode is enabled, the chunk allocation policy is modified as
follow.
Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)
Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for
the X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type.
This means also that it is a preferred choice.
Each time the allocator allocates a chunk of type X , first it takes the
disks tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space
is not enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY;
if the space is not enough, it uses also the other disks, with the
exception of the one marked as ALLOCATION_PREFERRED_Y, where Y the other
type of chunk (i.e. not X).
Signed-off-by: Goffredo Baroncelli <kreijack@inwind.it>
---
fs/btrfs/volumes.c | 97 +++++++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/volumes.h | 1 +
2 files changed, 97 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 7778f55e659325..a5bcad391276a5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -174,6 +174,19 @@ enum btrfs_raid_types __attribute_const__ btrfs_bg_flags_to_raid_index(u64 flags
return BTRFS_BG_FLAG_TO_INDEX(profile);
}
+#define BTRFS_DEV_ALLOCATION_MASK ((1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT) - 1)
+#define BTRFS_DEV_ALLOCATION_MASK_COUNT (1ULL << \
+ BTRFS_DEV_ALLOCATION_MASK_BIT_COUNT)
+
+static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+ [BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
+ [BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ /* the other values are set to 0 */
+};
+
const char *btrfs_bg_type_to_raid_name(u64 flags)
{
const int index = btrfs_bg_flags_to_raid_index(flags);
@@ -5053,13 +5066,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info *fs_info,
}
/*
- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
*/
static int btrfs_cmp_device_info(const void *a, const void *b)
{
const struct btrfs_device_info *di_a = a;
const struct btrfs_device_info *di_b = b;
+ if (di_a->alloc_hint > di_b->alloc_hint)
+ return -1;
+ if (di_a->alloc_hint < di_b->alloc_hint)
+ return 1;
if (di_a->max_avail > di_b->max_avail)
return -1;
if (di_a->max_avail < di_b->max_avail)
@@ -5212,6 +5230,8 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
int ndevs = 0;
u64 max_avail;
u64 dev_offset;
+ int hint;
+ int i;
/*
* in the first pass through the devices list, we gather information
@@ -5264,16 +5284,91 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
devices_info[ndevs].max_avail = max_avail;
devices_info[ndevs].total_avail = total_avail;
devices_info[ndevs].dev = device;
+
+ if ((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+ (ctl->type & BTRFS_BLOCK_GROUP_METADATA)) {
+ /*
+ * if mixed bg set all the alloc_hint
+ * fields to the same value, so the sorting
+ * is not affected
+ */
+ devices_info[ndevs].alloc_hint = 0;
+ } else if (ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_METADATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (data disk
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ } else { /* BTRFS_BLOCK_GROUP_METADATA */
+ hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+ /*
+ * skip BTRFS_DEV_DATA_ONLY disks
+ */
+ if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+ continue;
+ /*
+ * if a data chunk must be allocated,
+ * sort also by hint (metadata hint
+ * higher priority)
+ */
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ }
+
++ndevs;
}
ctl->ndevs = ndevs;
+ /*
+ * no devices available
+ */
+ if (!ndevs)
+ return 0;
+
/*
* now sort the devices by hole size / available space
*/
sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
btrfs_cmp_device_info, NULL);
+ /*
+ * select the minimum set of disks grouped by hint that
+ * can host the chunk
+ */
+ ndevs = 0;
+ while (ndevs < ctl->ndevs) {
+ hint = devices_info[ndevs++].alloc_hint;
+ while (ndevs < ctl->ndevs &&
+ devices_info[ndevs].alloc_hint == hint)
+ ndevs++;
+ if (ndevs >= ctl->devs_min)
+ break;
+ }
+
+ BUG_ON(ndevs > ctl->ndevs);
+ ctl->ndevs = ndevs;
+
+ /*
+ * the next layers require the devices_info ordered by
+ * max_avail. If we are returing two (or more) different
+ * group of alloc_hint, this is not always true. So sort
+ * these gain.
+ */
+
+ for (i = 0 ; i < ndevs ; i++)
+ devices_info[i].alloc_hint = 0;
+
+ sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
+ btrfs_cmp_device_info, NULL);
+
return 0;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 917d75b7b983e3..12777e8e2769c8 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -493,6 +493,7 @@ struct btrfs_device_info {
u64 dev_offset;
u64 max_avail;
u64 total_avail;
+ int alloc_hint;
};
struct btrfs_raid_attr {
From 186029a410e9bf8df21f6d7af3d8d6ce6ff01c84 Mon Sep 17 00:00:00 2001
From: Kai Krakow <kk@netactive.de>
Date: Thu, 27 Jun 2024 20:05:58 +0200
Subject: [PATCH 5/5] btrfs: add allocator_hint for no allocation preferred
This is useful where you want to prevent new allocations of chunks on a
disk which is going to removed from the pool anyways, e.g. due to bad
blocks or because it's slow.
Signed-off-by: Kai Krakow <kai@kaishome.de>
---
fs/btrfs/volumes.c | 6 +++++-
include/uapi/linux/btrfs_tree.h | 2 ++
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a5bcad391276a5..ffcb6ee7c50ee2 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -184,6 +184,7 @@ static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
[BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
[BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 1,
[BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 2,
+ [BTRFS_DEV_ALLOCATION_PREFERRED_NONE] = 99,
/* the other values are set to 0 */
};
@@ -5320,7 +5321,10 @@ static int gather_device_info(struct btrfs_fs_devices *fs_devices,
* sort also by hint (metadata hint
* higher priority)
*/
- devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
+ if (hint == BTRFS_DEV_ALLOCATION_PREFERRED_NONE)
+ devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+ else
+ devices_info[ndevs].alloc_hint = alloc_hint_map[hint];
}
++ndevs;
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 0261ed5f0adabd..9946e497dc8367 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -569,6 +569,8 @@ struct btrfs_node {
#define BTRFS_DEV_ALLOCATION_METADATA_ONLY (2ULL)
/* only data chunk allowed */
#define BTRFS_DEV_ALLOCATION_DATA_ONLY (3ULL)
+/* preferred no chunk, but chunks allowed */
+#define BTRFS_DEV_ALLOCATION_PREFERRED_NONE (4ULL)
/* 5..7 are unused values */
struct btrfs_dev_item {