您現在的位置是：首頁 > 綜合

顏水成、程明明團隊開源ViP，引入三維資訊編碼機制，無需卷積

由極市平臺發表于綜合
2022-01-06

簡介Experiments上表給出了近期類MLP方法的效能對比，可以看到：所提ViP-Small7憑藉25M引數取得了81

程明明名字怎麼樣

作者丨Happy

審稿丨鄧富城

編輯丨極市平臺

paper： https：//arxiv。org/abs/2106。12368

code： https：//github。com/Andrew-Qibin/VisionPermutator

本文是顏水成與程明明團隊在MLP架構方面的一次探索，從位置資訊編碼出發，引入了高-寬-通道三維資訊編碼機制；為進一步校正不同分支的作用，提出了加權融合方式（即注意力機制）。該文的思路與一作之前的“TripletAttention”非常相似，區別在於前者作用於CNN，而後者作用於MLP架構。Whatever，ViP將類MLP架構的效能進一步向上推了一把，使其具有與CNN、Transformer相當的競爭力。

Abstract

本文提出一種概念簡單、資料高效的類MLP架構Vision Permutator（ViP）用於視覺識別任務。

透過對2D特徵表達所攜帶的位置資訊重要性的認知，ViP採用線性投影方式沿高與寬維度編碼特徵表達。

這使的ViP能夠

沿單一空間維度捕獲長距離依賴關係

，同時

沿另一個方向保持精確的位置資訊

，然後透過相互補充聚合方式產生位置敏感輸出，進而形成關於目標區域的強有力表徵。

作者透過實驗表明：ViP具有與CNN、Transformer相當的競爭力。無需空域卷積或者注意力機制，無需額外大尺度訓練資料，僅需25M可學習引數，ViP在ImageNet上取得了81。5%的top1精度，這比大部分CNN與Transformer都要優秀。當把模型引數提升到88M，模型精度可以進一步提升到

83.2%

。作者期望該工作能促進社群重新思考空間資訊的編碼並輔助類MLP方法的設計。

Method

Permutator

Permute-MLP

上圖給出了Permute-MLP的結構示意圖，它包含三個分支，每個分支用於編碼不同的資訊：高、寬、通道。通道資訊的編碼比較簡單，我們僅需全連線層進行線性投影；重點是如何編碼高與寬兩個維度的空間資訊。

Weighted Channel-MLP

在上面的公式中，本文采用了簡單的加法進行三分支融合。這裡，本文透過重校正不同分支的重要性改進Permute-MLP並提出了Weighted Permute-MLP。直接看code吧，如下所示：

class WeightedPermuteMLP（nn。Module）： def __init__（self， dim， segment_dim=8， qkv_bias=False， qk_scale=None， attn_drop=0。， proj_drop=0。）： super（）。__init__（） self。segment_dim = segment_dim self。mlp_c = nn。Linear（dim， dim， bias=qkv_bias） self。mlp_h = nn。Linear（dim， dim， bias=qkv_bias） self。mlp_w = nn。Linear（dim， dim， bias=qkv_bias） self。reweight = Mlp（dim， dim // 4， dim *3） self。proj = nn。Linear（dim， dim） self。proj_drop = nn。Dropout（proj_drop） def forward（self， x）： B， H， W， C = x。shape S = C // self。segment_dim h = x。reshape（B， H， W， self。segment_dim， S）。permute（0， 3， 2， 1， 4）。reshape（B， self。segment_dim， W， H*S） h = self。mlp_h（h）。reshape（B， self。segment_dim， W， H， S）。permute（0， 3， 2， 1， 4）。reshape（B， H， W， C） w = x。reshape（B， H， W， self。segment_dim， S）。permute（0， 1， 3， 2， 4）。reshape（B， H， self。segment_dim， W*S） w = self。mlp_w（w）。reshape（B， H， self。segment_dim， W， S）。permute（0， 1， 3， 2， 4）。reshape（B， H， W， C） c = self。mlp_c（x） a = （h + w + c）。permute（0， 3， 1， 2）。flatten（2）。mean（2） a = self。reweight（a）。reshape（B， C， 3）。permute（2， 0， 1）。softmax（dim=0）。unsqueeze（2）。unsqueeze（2） x = h * a［0］ + w * a［1］ + c * a［2］ x = self。proj（x） x = self。proj_drop（x） return x class Mlp（nn。Module）： def __init__（self， in_features， hidden_features=None， out_features=None， act_layer=nn。GELU， drop=0。）： super（）。__init__（） out_features = out_features or in_features hidden_features = hidden_features or in_features self。fc1 = nn。Linear（in_features， hidden_features） self。act = act_layer（） self。fc2 = nn。Linear（hidden_features， out_features） self。drop = nn。Dropout（drop） def forward（self， x）： x = self。fc1（x） x = self。act（x） x = self。drop（x） x = self。fc2（x） x = self。drop（x） return x

Configurations of ViP