pytorch 矩阵数据标准化 pytorch制作数据集

转载

archangle 2023-10-19 16:02:56

文章标签 pytorch 矩阵数据标准化深度学习人工智能数据集 Image 文章分类 PyTorch 人工智能

前言

数据集有哪些需求？

前言

本人目前在进行深度学习的研究，首先从最开始的数据集上就造成了很大的困扰，在网络上有一些数据集制作的方法，但是都不太能满足想要达到的效果，所以想把我的一些思路跟方式分享出来。

本篇博客面向了解部分深度学习内容并想自己动手完成整个过程的人。

话不多说，先上代码(代码繁琐，勿喷)

import numpy as np
import torch
from torch.utils.data import Dataset
import torchvision
from PIL import Image


class PATH(Dataset):
    def __init__(self, txtpath, n_class):
        super(PATH, self).__init__()
        self.imgs = []
        self.n_class = n_class
        self.datainfo = open(txtpath, 'r')
        for line in self.datainfo:
            line = line.strip('\n')
            words = line.split('-')
            self.imgs.append((words[0], words[1]))

    def __getitem__(self, index):
        feature, label = self.imgs[index]
        if int(label) < 0 or int(label) >= self.n_class:
            raise IndexError(f"input label out of the index >= {self.n_class} or < 0")
        zero_array = np.zeros(self.n_class)
        zero_array[int(label)] = 1 
        label = torch.as_tensor(tuple(float(singal) for singal in zero_array))
        feature = Image.open(feature)
        feature = feature.resize((224, 224))
        feature = torchvision.transforms.ToTensor()(feature)
        return feature, label

    def __len__(self):
        return len(self.imgs)

代码

上图代码为制作出的数据集的类的内容，它是torch.utils.data类的重写，除此，我们就可以想下载torch中自带的数据集的方法一样：

from torch.utils.data import DataLoader

    train_data = PATH(train_txt, n_class)
    train_data_loader = DataLoader(train_data, batch_size, shuffle=True)
    test_data = PATH(test_txt, n_class)
    test_data_loader = DataLoader(test_data, batch_size, shuffle=True)
    val_data = PATH(val_txt, n_class)
    val_data_loader = DataLoader(val_data, batch_size, shuffle=True)

这样我们将创建出了一个可以供神经网络实例训练的数据集了。

数据集有哪些需求？

1. 因为我们是在pytorch的框架下制作的数据集，当然首先要保证feature、label都是张量

2. 数据集能根据我们的需求，进行检测观察与操作

3. 在分类问题中，最好将label设置成one-hot编码的可迭代对象（如列表）

one-hot编码的特点是：在所有项(在分类问题中就是num_class项)中，除了某一项为1，其余都为0(如果在四分类中，就有四种方式：[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], 分别对应了四种标签)。

为什么要选择one-hot编码？

在分类问题中，交叉熵是最常用的loss之一，如下图所示：

loss = nn.CrossEntropyLoss()


def operator(y_pred, y, loss):
    l = loss(y_pred, y)
    # argmax(dim=int), 返回dim上的最大项的索引值。
    y_pred_probs = argmax(y_pred, dim=1)
    y_probs = argmax(y, dim=1)
    samples_number = y_pred.shape[0]
    correct_number = 0
    for i in range(samples_number):
        if y_pred_probs[i] == y_probs[i]:
            correct_number += 1
    acc = correct_number / samples_number
    return l, acc

y_pred为经过神经网络计算得出的预测标签，如果是小批量训练，那么y_pred.shape=[batch_size, n_class] ，正如大部分loss函数一样，y_pred与y的shape应该是相同的，才能一一对应的进行计算。

pytorch 矩阵数据标准化 pytorch制作数据集_Image

所以，one-hot编码的设计也使得y.shape=[batch_size, n_class] , 虽然还有其他的方式来让两个变量变成同一shape，如y_pred=y_pred.view(-1) , 通过压缩一个维度的方式降维，但这样的结果并不准确。(为什么不准确先挖个坑，现在表达不明白，再想想怎么组织语言。)