freeatlantis.com is one of the many independent Mastodon servers you can use to participate in the fediverse.

Administered by:

Server stats:

191
active users

#rl

0 posts0 participants0 posts today
Stuart Spence<p>Latest Karpathy video is a great semi technical overview of LLMs and other related concepts:</p><p><a href="https://mstdn.ca/tags/llm" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>llm</span></a> <a href="https://mstdn.ca/tags/ai" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>ai</span></a> <a href="https://mstdn.ca/tags/karpathy" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>karpathy</span></a> <a href="https://mstdn.ca/tags/chatgpt" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>chatgpt</span></a> <a href="https://mstdn.ca/tags/llama" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>llama</span></a> <a href="https://mstdn.ca/tags/gemini" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>gemini</span></a> <a href="https://mstdn.ca/tags/rlhf" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>rlhf</span></a> <a href="https://mstdn.ca/tags/rl" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>rl</span></a></p><p><a href="https://youtu.be/7xTGNNLPyMI?feature=shared" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="ellipsis">youtu.be/7xTGNNLPyMI?feature=s</span><span class="invisible">hared</span></a></p>
Tero Keski-Valkama<p>Hear me out: I think applying RL on <a href="https://rukii.net/tags/LLMs" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>LLMs</span></a> and LMMs is misguided, and we can do much better.</p><p>Those <a href="https://rukii.net/tags/RL" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>RL</span></a> algorithms are unsuitable for this, and for example they cannot learn how their decisions affect the eventual rewards, but instead are just optimized to make the decisions based on Bellman optimization.</p><p>Instead we can simply condition the LLMs with the rewards. The rewards become the inputs to the model, not something external to it, so the model will learn the proper reward dynamics, instead of only being externally forced towards the rewards. The model can itself do the credit assignment optimally without fancy mathematical heuristics!</p><p>This isn't a new idea, it comes from goal-conditioned RL, and decision transformers.</p><p>We can simply run the reasoning trajectories, judge the outcomes, and then put the outcome tokens first to these trajectories before training them to the model in a batch.</p><p><a href="https://arxiv.org/abs/2211.15657" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="">arxiv.org/abs/2211.15657</span><span class="invisible"></span></a></p>
Hiker<span class="h-card"><a class="u-url mention" href="https://mk.absturztau.be/@puniko" rel="nofollow noopener noreferrer" target="_blank">@<span>puniko</span></a></span> Oh, danke der Nachfrage. Es geht gut, das <a class="hashtag" href="https://social.fedcast.ch/tag/rl" rel="nofollow noopener noreferrer" target="_blank">#RL</a> verlangt zZ viel Aufmerksamkeit, darum bin ich hier ein bisschen weniger präsent 😉