First published Thu Sep 14, 2023

Inner speech is known as the “little voice in the head” or “thinking in words.” It attracts philosophical attention in part because it is a phenomenon where several topics of perennial interest intersect: language, consciousness, thought, imagery, communication, imagination, and self-knowledge all appear to connect in some way or other to the little voice in the head. Specific questions about inner speech that have exercised philosophers include its similarities to, and differences from, outer speech; its relationship to reasoning and conceptual thought; its broader cognitive roles—especially within metacognition and self-knowledge; and the role it can play in explanations of auditory verbal hallucinations and “thought insertion”.

A more formal characterization of inner speech (yet one that still aims at theoretical neutrality) is to say that inner speech is a mental phenomenon that is both keyed to a natural language and often available to introspection. To say that inner speech is “available to introspection” is to say that each person has an introspective way of knowing about their own inner speech episodes that others lack (Schwitzgebel 2010 [2019]); our access to our own inner speech is—at least often—comparable to our access to others of our conscious mental episodes. To say that inner speech is “keyed to” a natural language is to say that it either occurs in a natural language (like words spoken aloud) or represents words of a natural language (like an audio recording of a speech), or that it does both. In specifying that the language in question is a “natural” language, we mean to include any language one may acquire through learning—such as English, Japanese, or American Sign Language—and to exclude any innate mental languages that may exist (such as a Fodorian [1975] Mentalese, or other innate “language of thought”). This characterization leaves open several questions of controversy including: (1) whether all inner speech is available to introspection; (2) whether inner speech is literally a form of speech, a form of thought, or both, or neither; and (3) whether inner speech occurs in a natural language, or represents items of a natural language, or both.

Inner speech is a subject of study in many distinct disciplines, including neuroscience, speech pathology, developmental psychology, psychiatry, computer science, and linguistics, as well as philosophy. For this reason, there are a variety of distinct theoretical tools and concepts one might use to describe its nature and cognitive roles, with correspondingly distinct aims, methods, and literatures. We focus here on the accounts contemporary philosophers have given of its nature and on the explanatory purposes to which inner speech is most commonly put in philosophical work. Nevertheless, much of the contemporary philosophical work on inner speech is itself interdisciplinary in nature and aims to be consistent with, and informed by, results in allied disciplines, including, especially, experimental psychology, linguistics, and neuroscience. We discuss those sources where relevant to the issues that have exercised philosophers, while directing readers most interested in the empirical work to other reviews, such as Alderson-Day & Fernyhough (2015), Langland-Hassan (2021), and Perrone-Bertolotti et al. (2014). For another philosophically-oriented review, see Vicente & Martínez-Manrique (2011). Also, we focus on inner speech linked to the auditory modality, as opposed to inner speech that may occur in a gestural or visual modality (as may be the case with gestural sign languages), because nearly all of the existing research on inner speech concerns a phenomenon that is in some way linked to audition.

Finally, the phenomenon of speaking to oneself audibly is usually referred to as “private speech”. Some parts of the discussion in this entry are easily transferable to private speech, e.g., the matter of whether we can perform speech acts in inner speech. Others are not. For example, the question of whether inner speech, as a mental phenomenon, is actually a kind of speech has no counterpart in the context of private speech, as private speech is uncontroversially a kind of speech. It is typically easy to determine whether a question about inner speech also applies to private speech, so we will not comment on this further.

1. Inner Speech as Actual Speech

The auditory-sensory character of inner speech is usually thought to be due to its involvement of auditory-verbal imagery (for an exception, see O’Brien 2013). Mental images (in any modality) are generally viewed as representations of particular things (or kinds of thing), not instances of those things. A visual image of a duck, for example, is a representation of a duck, not an actual duck. Likewise, it may seem that inner speech, insofar as it involves auditory imagery, is a representation of speech (and of its sounds, in particular), not actual speech. “Grass is green”, produced in inner speech, would then represent an utterance of the sentence, “Grass is green”, but it would not actually be an utterance of that sentence.

Notwithstanding this, many philosophers working on inner speech hold that inner speech really is a kind of speech. When we produce inner speech, we are literally speaking, albeit silently. We will call this view the “actual speech view”. Proponents include Carruthers (1996), Martínez-Manrique & Vicente (2010, 2015), Gauker (2011, 2018), O’Brien (2013), Jorba & Vicente (2014), Gerrans (2015), Gregory (2016, 2018) (though Gregory has indicated in more recent work (e.g., Gregory forthcoming) that he no longer holds the view), Machery (2018), Wilkinson & Fernyhough (2018), Wilkinson (2020), and Frankfort (2022). Martínez-Manrique & Vicente argued for the view in their 2010 paper; their 2015 paper, discussed in Section 3.3.2, sets out an updated version of their theory which incorporates some further commitments. Historically, the view can be traced at least to the Soviet psychologist, Lev Vygotsky (1934 [1986]), and it was also held by Ryle (1949 [2009]). However, Gauker develops a somewhat different version of the view—one that sharply distinguishes inner speech from the auditory-verbal imagery typically associated with it (see Section 3.2 for discussion)—which has its origins in Sellars (1956).

If inner speech is a kind of speech, instances of inner speech could aptly be called “inner speech utterances”, as producing inner speech would really amount to saying something. However, in order to be neutral on the issue, we will use the term, “inner speech episodes”, in this section and throughout the entry. An important issue for a proponent of the actual speech view is to explain how inner speech can consist of genuine linguistic tokens, given that it seems to be an imagistic phenomenon—where, as noted, the images may appear to be representations of speech sounds. For, even if inner speech consists of images of speech sounds, this only suggests that it consists of representations of linguistic items, not linguistic items themselves.

One style of answer has been offered by Sam Wilkinson (2020), who draws a distinction between imagery and imagination. He holds that sensory imagining is a “personal-level phenomenon”, which has components (2020: 16). One of the components of sensory imagining (as opposed to propositional or “attitudinal” imagining, which is typically assumed to be non-imagistic in nature) is mental imagery. For example, if one sensorily imagines a duck, then one component of this personal-level mental state may be a mental image resembling the appearance of a duck. There might also be other components, such as a stipulation that the image is an image of a duck and not another bird of similar appearance. But, Wilkinson emphasizes, mental imagery can be involved in many personal-level mental attitudes apart from the attitude of imagining, such as remembering, judging, reasoning, and others. “In a similar way”, he claims,

imagery … may be involved in an inner assertion. That does not, however, make the inner assertion simply nothing more than the imagery involved in its production, still less an act of imagination. (2020: 16).

Imagery can play many roles, Wilkinson is saying, and there is no reason that one of those roles should not be as a medium for linguistic tokens. The inner assertion is “a genuine assertion”—an instance of language consisting of imagery.

It might be replied that, although imagery can play a role in many personal-level mental states apart from imagining, it plays a very similar role in all of them, viz., representing how a concrete object (whether actual or possible) appears or sounds. Mental imagery does not tend to play a role similar to that of a linguistic token. So, even if mental imagery is involved in a range of personal-level mental states, it is not obviously well-suited, in the case of inner speech, to play the specific role of actual linguistic tokens.

This challenge might be met by connecting the actual speech view with work on the metaphysics of word tokens, as proposed by Wade Munroe (2022a, 2023). Munroe holds that

what makes something, φ, a token of a word type, w, is that the process of generating φ is explained and guided by one’s (tacit) knowledge of w (or the morphological structure of w), e.g., one’s semantic, syntactic, morphophonological/orthographic, knowledge of w stored in one’s mental lexicon. (2022a: 4)

This allows him to hold that inner speech episodes can involve word tokens, insofar as their generation is guided by the relevant kind of tacit knowledge. (Though Munroe himself does not hold the actual speech view; see Section 3.3.1 for discussion of his view.) Relatedly, J. T. M. Miller (2021) explicitly denies that word tokens are necessarily substances and holds, instead,

that particular or token words are objects, which are bundles of various sorts (most notably semantic, phonetic, orthographic, and grammatical) properties. [sic] (2021: 5737)

One might hold that inner speech episodes in fact consist in such bundles.

Although the matter of how inner speech episodes can involve genuine linguistic tokens is of great importance for the actual speech view, it is only beginning to receive attention. However, several arguments have been given in support of the theory generally. These include the following:

  1. Inner speech may be a developmental descendant of a kind of external speech. Piaget (1923 [1926/1959]) observed that young children have a practice of speaking to themselves aloud. He described this kind of speech as “egocentric speech” (ibid, passim) (egocentric speech can be seen as one kind of private speech; see Introduction). Vygotsky (1934 [1986]) presented empirical evidence that inner speech develops in children as they internalize the practice of producing egocentric speech (though see Gregory (forthcoming) questioning this evidence). Vygotsky held that egocentric speech becomes silent, inner speech, but that it does not change in its fundamental nature, so it remains a kind of actual speech (see also Wilkinson & Fernyhough (2018), Wilkinson (2020)).
  2. Introspectively, it seems like we can perform speech acts—e.g., make assertions and ask questions—in inner speech. But it would only be possible to perform speech acts in inner speech if inner speech is a kind of speech (Wilkinson 2020; Wilkinson & Fernyhough 2018). (This issue is addressed further in Section 4.1.)
  3. On the face of it, we produce inner speech for purposes such as focusing our attention, motivating ourselves, and evaluating our actions. These correspond to purposes which instances of external speech also often serve: focusing the attention of others, motivating them, and commenting on their actions. There are also parallels in terms of how inner speech episodes and instances of external speech are constructed. Both often take the form of short, sub-sentential items when this is sufficient (e.g., “Here!”, upon finding something which was lost) and more fully elaborated sentences when this is necessary (e.g., when carefully listing the considerations relevant to a difficult decision which needs to be made, whether by oneself or by a group). Marta Jorba, Agustín Vicente, and Fernando Martínez-Manrique have taken these systematic parallels as evidence that inner speech and external speech are simply different types of one phenomenon, namely, speech (Jorba & Vicente 2014; Martínez-Manrique & Vicente 2015).
  4. There seems to be a contrast between imagining speaking and engaging in inner speech, as it is ordinarily understood. This contrast, Gregory (2016) suggests, parallels the contrast between two kinds of external actions which we can perform. When an actor says the lines in their script, what they are producing is a representation of speech that someone else might produce. The actor is, of course, speaking, but they are doing so in the context of a pretense. What the actor is doing contrasts with the speech which they produce in, e.g., an ordinary conversation with someone. The contrast between imagining speaking and producing inner speech seems to map neatly onto the contrast between what the actor does on the stage and what they do in an ordinary conversation. If this is so, then a natural analysis is that the contrast between imagined speech and inner speech is a contrast between a representation of speech and actual speech—which implies that inner speech is a kind of actual speech.

A couple of philosophers who hold the actual speech view but express it in different terms, or who hold very similar positions, should be mentioned. First, Philip Gerrans (2015) describes inner speech as involving “imaginary action” (2015: 296), but he is explicit that, by this, he means only to say that producing inner speech is an action performed covertly. He takes inner speech to involve speaking, but doing so silently.

Second, Johannes Roessler (2016) holds that there are different kinds of inner speech, one of which involves imagining speaking (rather than actually speaking), but in a particular way. He points out that we can imagine things, or imagine doing things, for different purposes. An act of imagining will then be successful to the extent that it achieves the purpose for which it is performed. So, one might, for example, imagine making an assertion, but do so with the intention of imagining making an assertion which is true and relevant to context. Then the act of imagining making the assertion “incurs the same liabilities” (2016: 548) that the act of actually making the assertion would incur. If you are puzzling over some question, and you imagine asserting a possible answer, then the act of imagining will be successful only if you have imagined asserting the correct answer. Although you have only imagined performing the speech act of making an assertion, your imagined assertion will be “in some ways tantamount to an assertion” (2016: 548).

It would be an open position, though not one Roessler takes, that all inner speech episodes could be analyzed in this way. On such a view, inner speech episodes would be something very similar to actual speech, yet without quite being speech acts, and thus without the commitment that producing inner speech involves producing actual linguistic items.

2. Inner Speech and Thought

A second question about inner speech is how it relates to thought. It seems that there must be some relationship, but it is an open question what that relationship is. In general, there are three views about the nature of the relationship: (1) inner speech episodes express thoughts; (2) inner speech episodes facilitate thoughts; and (3) inner speech episodes (at least sometimes) are thoughts of a certain kind.

The views are not mutually exclusive: one can certainly hold that inner speech is related to thought in multiple ways.

2.1 Inner Speech and Thought Expression

Langland-Hassan & Vicente (2018b: 10) observe that the view that inner speech (at least often) expresses thoughts that are distinct from the inner speech episodes themselves coheres with some larger theories about thought and language. If one is attracted to these theories, then they may well also be attracted to the view that inner speech merely expresses thought.

First, there is a natural connection between the language of thought hypothesis, most closely associated with Jerry Fodor (1975), and the view that inner speech expresses thought. On the language of thought hypothesis, our thoughts do take place in a language, but not in a natural language. Rather, our thoughts take place in a kind of mental language, often referred to as “Mentalese”. If the language of thought hypothesis is true, then, insofar as inner speech is keyed to a natural language, it seems that inner speech can at most serve to express the thoughts which occur in the mental language.

Second, on Willem Levelt’s influential theory about language production, speaking involves conveying a pre-existing “message” (1989: passim). The structure of this message is conceptual but not linguistic. Via several stages of processing, natural language sentences (or sub-sentential items) are formulated which, once articulated, express the conceptually structured message with which the process started. If one thinks that inner speech is actually a kind of speech, then one might incline to think that inner speech also expresses a pre-existing message.

Thus, Peter Carruthers (2009, 2018) approaches matters from a Fodorian and Leveltian angle when he proposes that

the first metacognitive access subjects have to the fact that they have a particular belief is via its verbal expression (whether overtly or in inner speech). (2009: 125)

For Carruthers, the inner speech episode is not a belief or judgment itself, but rather the expression thereof (see Section 5.2). In a similar way, Ray Jackendoff (1996, 2007, 2011, 2012) emphasizes the distinction between thought itself and the auditory imagery by which it may be expressed, identifying only the latter with inner speech (see Section 3.1). Likewise, José Luis Bermúdez (2003) and Jesse Prinz (2011) distinguish between conceptual thought itself and inner speech, while holding that we often come to know what we are thinking by attending to inner speech sentences that we might use to express such thoughts. They stop short of explicitly claiming that such sentences actually express thoughts, however, specifying instead that the inner episodes are sentences through which such thoughts “might be expressed” (Bermúdez 2003: 164), or that we “would use” to express them (Prinz 2011: 186) (see Section 5.1).

One can, however, hold that inner speech episodes express thoughts without committing to the view that a thought must be fully-formed prior to the production of the relevant inner speech episode. José Luis Bermúdez (2018), for example, holds that producing an inner speech episode can actually play a role in forming the thought which it expresses. For Bermúdez, a thought can be refined and precisified as an external utterance is being produced and, equally, a thought can be refined and precisified while an inner speech episode is being produced. Nonetheless, by the time an inner speech episode has been produced, it will express an existing thought.

Finally, it is worth noting the following point of contact between the actual speech view, discussed in Section 1, and the question of whether inner speech expresses thought. If it is an essential feature of speech that it serves to express thought, then defenders of the actual speech view are likewise committed to the view that inner speech expresses thought. If, on the other hand, one holds that there can be (inner) speech that does not express thought, then the question arises as to what the difference between (inner) speaking and thinking in a natural language might be—and whether there is indeed a difference.

2.2 Inner Speech and Thought Facilitation

There have been several suggestions as to how inner speech might play a substantive role in facilitating thought or thought processes—a role that goes beyond merely expressing thought processes.

First, inner speech is often thought to play an important role in working memory. According to Alan Baddeley’s influential theory of working memory (e.g., Baddeley 1992), we can retain a series of words or numbers in working memory by reciting them in inner speech. A short series of items will be retained long enough to recite them again. One can iterate this process via a “phonological loop” for as long as desired.

Following Vygotsky (1934 [1986]), Clowes (2007) and Jorba & Vicente (2014) hold that inner speech can serve as a tool for directing our own attention, just as external speech can serve as a tool to direct the attention of others. In making this case, both draw on the Vygotskyan developmental account of inner speech, on which inner speech is derived from the external phenomenon. See also Martínez-Manrique & Vicente (2015), who make the same point but are less directly influenced by Vygotsky’s original (1934 [1986]) developmental account.

There is evidence that inner speech facilitates various executive function tasks, such as planning, task-switching, and inhibiting impulsive and inappropriate responses, without being essential to them. The evidence that inner speech can play a role in these tasks is primarily empirical. For reviews of the relevant literature, see Alderson-Day & Fernyhough (2015) and Petrolini, Jorba, & Vicente (2020).

Munroe (2022b; forthcoming) argues that inner speech plays a role in reasoning which goes beyond merely aiding or improving it. He notes that reasoning processes often involve preserving representations in working memory. In doing complex mental arithmetic, for example, one might recite in inner speech the word for a number which they have determined will be needed later in the process, e.g., when regrouping values (i.e., “carrying” and “borrowing”). The number word will be stored in working memory via the process described above. But, on Baddeley’s model of working memory, which Munroe is working with, only sensory representations can be stored in working memory. In the present context, this means that only auditory representations of the relevant word sounds can be stored, not the conceptual content which the word would have if spoken aloud (or, possibly, if it were produced in inner speech in a different context, depending on one’s view on the contents of inner speech—see Section 3). When one needs to use the number at a later stage in the process, they will need to interpret the sensory representation which they are producing. For example, if they are reciting a sound corresponding to the word, “six”, in inner speech, they will need to interpret that as the word referring to the number, six, so that six becomes the number that they now use to continue their calculations. If this is so, then interpreting the inner speech that one was producing, and thus the inner speech itself, was essential to the reasoning process, not merely a dispensable aid. Munroe holds that the same will apply in many reasoning processes performed that require making use of an intermediate conclusion.

A number of theorists—especially those working in neo-empiricist (Barsalou 1999; Prinz 2011, 2012) and embodied cognition traditions (Borghi et al. 2017; Dove 2014)—have also proposed that inner speech plays an important role in facilitating abstract thought, i.e., thought about objects or properties that are not easily perceived. Here the idea is that language perception and production abilities—and their internalization, via inner speech—provide means for explaining the acquisition and use of abstract concepts in broadly sensorimotor terms. In particular, Guy Dove (2014, 2018, 2020, 2022) develops a view where language—often in the form of inner speech—is used as a “scaffold” or “tool” for enabling thought about abstract entities, and where the capacity for abstract concept use is closely tied to the capacity for language.

Finally, if subsystems and modules in the mind function in isolation from one another to any significant extent, then inner speech may play an important role in integrating their output. Carruthers (2002, 2006) suggests that the process of language production generally, including the production of inner speech, is especially well suited to integrate the output of multiple modules, because of the combinatorial nature of language. In producing an episode of inner speech, one can thus express complex content, which is then distributed to mental modules and subsystems for further processing. Other sources relevant to inner speech and the integration of information produced by different parts of the mind include Baars (1988) and Dennett (1991).

2.3 Inner Speech as Thought

A number of philosophers have argued that at least some inner speech episodes actually are thoughts or, at least, parts of thought processes. Gauker (2011, 2018) holds that all conceptual thought occurs in inner speech, where, as elaborated in Section 3.2, he takes inner speech to involve the tokening of items of a natural language in neural states that are distinct from the auditory-verbal representations that many identify with inner speech. In his 2011 book, he responds to arguments that conceptual thought cannot occur in natural language.

With respect to inner speech understood as a partly sensory phenomenon, Keith Frankish (2018) describes how inner speech can be used to break a complicated problem into smaller problems, which can then be addressed by lower level, automatic thought processes. Deciding whether to accept an invitation from colleagues to attend a party, for example, one might produce the inner speech episode, “What will it be like?”. This more circumscribed question can be addressed by autonomous processes, such as recalling previous parties with colleagues. Along with other autonomous processes, this might generate the prediction that an annoying colleague, Henry, will likely be at the party. If this is significant, it could result in the inner speech episode, “Henry will probably be there”, in turn prompting a largely autonomous evaluation of the effort involved in enduring Henry’s company. The process could result, depending on the outcome of this evaluation, in producing the inner speech episode, “I can’t face that; I won’t go”. (Quotes from Frankish [2018: 234], though the example is slightly modified.) The inner speech episodes, Frankish believes, are critical to making the decision, and are thus rightly considered parts of the process of thinking itself. See Kompa (forthcoming) for a similar argument; cf. Munroe (2022b), discussed above, who also holds that an inner speech episode can be essential to a thought process but does not infer from this that an inner speech episode can actually be a part of the process, but see also discussion of Munroe (2023) below.

Frankish (2018) also holds that inner speech episodes can be thoughts in the form of conscious commitments, where these are “a distinct kind of mental attitude” (2018: 237), which cannot be analyzed in terms of other conscious mental states, such as conscious decisions, beliefs, or desires, or expressions of other mental states. They are simply commitments made to oneself to “regulat[e] our future activities, including our intentional reasoning, in line with the choice or view expressed” (2018: 237). For example, the inner speech episode, “I will go to the gym today”, is a commitment to go to the gym today, not just the expression of a decision to do so, because it also generates a kind of obligation to oneself, as it were, to do so. For Frankish, this follows from treating inner speech as an internalized version of interpersonal speech, in which commitments also generate obligations.

On Frankish’s account, an inner speech episode can be like a judgment, insofar as it may involve committing oneself to act and reason in a way which is consistent with the truth of the proposition expressed by the inner speech episode. Munroe (2023), by contrast, holds that an inner speech episode can actually function as a judgment. If an inner speech episode is accompanied by what has been called a “Feeling of Rightness” (Munroe cites Thomson et al. 2013 and Unkelbach & Greifender 2013), then it will play roles typically attributed to judgments such as “terminating inquiry and causing overt actions” (Munroe 2023: 309). Munroe connects his claim to a model proposed by Ackerman & Thompson (2015, 2017a, 2017b) on which the roles that mental states play is determined partly by metacognitive monitoring. The “Feeling of Rightness” is a cue to a metacognitive monitoring system that a particular mental state can appropriately play the roles of a judgment. Munroe’s claim is that inner speech episodes can function as judgments if this is deemed appropriate by the metacognitive monitoring system, on account of being accompanied by the appropriate “Feeling of Rightness” (or at least by a feeling of sufficient certainty).

Nikola Kompa (forthcoming) adds a quite different argument for the identity of (some) thoughts and (some) inner speech episodes. She operates with a broad notion of inner speech, on which any “inner episode that substantially engages the speech production system” is an instance of inner speech (forthcoming: 4, emphasis removed). On this understanding of inner speech, any thought with semantic content and syntactic structure will be an instance of inner speech, even if it does not become conscious. Kompa rejects the language of thought hypothesis, on which thoughts can have linguistic properties because they occur in a non-natural language. Accordingly, for Kompa, the only way that a thought can have semantic content and syntactic structure is if its formation substantially involves the speech production system (which she understands in Leveltian terms, citing Levelt 1989; Levelt et al. 1999; and Indefrey & Levelt 2004). Insofar as we have any thoughts that have semantic content and syntactic structure, then, these are, on her definition, instances of inner speech. If the production of such thoughts does not proceed further through the speech production process, such that they are morpho-phonologically encoded in addition to having semantic content and syntactic structure, they will occur as unconscious inner speech episodes.

Finally, it has been suggested that there is a close connection between inner speech and a phenomenon known as “unsymbolized thought”. Using the Descriptive Experience Sampling paradigm, Russell Hurlburt and Christopher Heavey (e.g., Hurlburt & Heavey 2002; Heavey & Hurlburt 2008) have gathered introspective data that they interpret as providing evidence that people sometimes have the experience of

thinking a particular, definite thought without the awareness of that thought’s being conveyed in words, images, or any other symbols. (Heavey & Hurlburt 2008: 802)

Martínez-Manrique & Vicente (2015), Vicente & Martínez-Manrique (2016), and Vicente & Jorba (2019) suggest that these “unsymbolized thoughts” occur when the production of an inner speech episode is aborted at the earliest stage of production, when only the content or message to be expressed has been formulated. Appealing to accounts on which we experience conscious representations of actions which we begin to perform but abort, they suggest that an unsymbolized thought is a representation of the message which one commenced expressing in inner speech, which becomes conscious because the process was aborted. Insofar as the process was aborted prior to the message being organized in phonetic form, the representation is entirely amodal. See also Kompa (forthcoming).

3. Content-Based Theories of Inner Speech

We have seen that there are a variety of views taken on whether inner speech is indeed a kind of speech, or a kind of thought, or both. A popular way to gain added leverage on those questions is to advance an account of the contents of inner speech. Focusing on questions concerning the contents of inner speech also helps to clarify the depth of some of the puzzles and controversies already introduced.

Most generally, the content of a representation is what the representation is of or about—it is what the representation represents. The content of the word “cat” is a certain type of animal (namely, a cat). And, the content of the sentence “cats are animals” is the proposition that cats are animals. Two distinct representations can have the same content. For instance, the French word “chat” has the same content as the English word “cat”; and the French sentence “les chats sont des animaux” has the same content as the English sentence “cats are animals”. Thus—to borrow analogies from Siegel (2005 [2021])—the contents of a mental state, in the present sense, are akin to the contents of a newspaper article and not akin to the contents of a bucket. Mental contents are not things that are contained within mental states themselves (just as cats are not contained within the word “cat”) but are, instead, what the mental states are of or about.

We will distinguish three broad classes of views about the contents of inner speech and several sub-views within them, noting their main motivations and relationships to questions concerning inner speech’s proposed cognitive roles. According to what we will call the “phonological content view”, inner speech episodes always and only have phonological contents. The competing content-based theories to be discussed hold either that inner speech only has semantic contents (the “semantic content view”, as we will call it) or that inner speech has phonological contents and semantic and/or other kinds of contents (the “mixed contents view”).

As we will see, the phonological content view is a natural fit with the view, discussed at the beginning of Section 1, that inner speech is merely a representation of speech and not actually a kind of speech. This is because the phonological content view sees inner speech as consisting in imagistic representations of speech and as lacking the kinds of contents (or meanings) associated with word tokens themselves. Likewise, those who hold that inner speech is actually speech will typically hold either a mixed contents view or a semantic content view, as these views allow inner speech episodes to have the kinds of semantic contents that are typically viewed as essential to being a linguistic token.

3.1 The Phonological Content View

To say that inner speech has phonological contents is to say that inner speech episodes represent phonemes (or phones), where phonemes are the most basic meaningless building-blocks from which any word of a language can be built. There are 44 phonemes in English, different combinations of which account for the distinct sound each word has in relation to all other words from which it can be aurally distinguished. The notion of a phoneme is somewhat of an abstraction, however, as slightly different sounds (in terms of pitch, timbre, and frequency) can fall within the sonic range that constitutes a single phoneme type. These more specific, concrete sounds that can qualify as instantiations of a phoneme are known as phones. Whether inner speech episodes represent phonemes or, instead, the finer-grained property of being a phone is a matter of dispute among those who hold that inner speech episodes have phonological contents (Patel 2021; Langland-Hassan 2018; Hill 2022).

Note also that, while the phonemes of most natural languages are auditory in nature—and are thus perceived through the sense of hearing—the notion of a phoneme has also been applied to gestural languages, such as American Sign Language (Sandler 2012; Stokoe 2005). So, the concept of a phoneme is not specific to any modality. It refers to the smallest meaningless units of a language that can be arranged and recombined to form the smallest meaningful units of that language, no matter which modality the language occurs in. In spoken languages, however, the auditory modality takes precedence over the visual/written modality, insofar as the phonemes are typically held to be sounds, while the graphemes are held to be letters or groups of letters that represent phonemes. While most will not consider the visualization of graphemes and written words to be cases of inner speech, it bears noting that such visualizations satisfy the neutral characterization of inner speech provided at the outset.

There are several reasons one might hold that inner speech episodes have phonological contents. The first is phenomenological in nature. What it is like to have an inner speech episode is similar to what it is like to hear oneself saying the corresponding words aloud. One might explain this phenomenological similarity by appeal to the fact that inner speech episodes and the corresponding cases of hearing represent similar properties—either phonemes or phones of a certain sort—and, accordingly, have similar contents. A second reason appeals to the fact that we can use inner speech episodes to judge whether two visually dissimilar written words—such as “blood” and “mud”—rhyme. As rhyming is a relationship between the sounds of words, the usefulness of inner speech episodes in judging rhymes would be explained if inner speech episodes represented word sounds and thereby allowed us to compare those sounds (Langland-Hassan 2014). A third reason that has been proposed for thinking that inner speech has phonological contents is that it is the representation of those features that allows one to discern which language we are exploiting when engaged in inner speech (Langland-Hassan 2018). (See Patel 2021 for a rebuttal.)

Jackendoff (1996, 2007, 2011) proposes that auditory contents exhaust the contents of inner speech. Jackendoff’s view is motivated in part by a prior commitment to the thesis that we do not think in a natural language. Like many in cognitive science, he sees natural language primarily as a means for communicating thoughts that themselves occur unconsciously in some other medium (such as a Fodorian “Mentalese”). According to Jackendoff, thought itself is never conscious, nor is the use of concepts. By contrast, inner speech—what he calls the “talking voice in the head” (1996: 10)—occurs consciously and does not involve the use of concepts. In having inner speech, he explains, “[w]e experience organized sounds”, whereas,

the content of our experience, our understanding of the sounds, is a different organization … called conceptual structure. (emphasis original, 1996: 12–13)

“The organization of this content”, he holds, “is completely unconscious” (1996: 13). Jackendoff identifies the inner voice with a representation of “phonological structure”, a representation having phonological content, yet no conceptual or semantic content. Whereas, the mental states constituting our understanding of what the voice is saying, he notes, are distinct conceptual states that occur unconsciously:

What we experience as our inner monologue is actually the phonological structure linked to the thought,

he explains.

We are aware of our thinking because we hear the associated sounds in our head. (Jackendoff 2011: 613)

(See also Jackendoff [2007: 80–85] where he remarks on the counterintuitive nature of his view: “How can the contents of consciousness consist of just a string of sounds?” [2007: 85].)

It should be noted that Jackendoff also suggests that inner speech episodes “express” thoughts, which would seem to support the view that such episodes have the semantic contents of our thoughts (e.g., “the linguistic modality can make reasons as such available in consciousness” [1996: 19] and “only through language can such concepts form part of experience rather than just being the source of intuitive urges” [1996: 23]). On the other hand, he equally emphasizes the overlooked fact that “linguistic structure has three major departments: phonological, syntactic, and semantic/conceptual structure”, and that “the forms in awareness—the qualia—most closely mirror phonological structure” (2007: 81). Most recently, he has proposed a view where what we intuitively mark as “conscious thought” has three components: a “pronunciation” of the thought, a feeling of meaningfulness, and the meaning attached to the pronunciation. There he holds that only the first two are conscious and appears to identify inner speech with the “pronunciation” component. This is in keeping with the phonological content view, as the (semantic) meaning of the pronunciation is something separate from the pronunciation and is only represented “backstage” (i.e., unconsciously) (Jackendoff 2012: 84–5).

Langland-Hassan (2014) provides a qualified defense of a phonological content view, motivated by worries about how a single mental state—in particular an episode of inner speech—can be said to represent both word sounds and word meanings simultaneously. He notes that a word’s meaning and its sound are entirely distinct properties, related only by convention. If mental states are individuated by their contents, then it seems that distinct neural or functional states will be needed to represent these distinct properties. This has become known as the “binding problem” for inner speech (see Munroe 2023; Patel 2021; Bermúdez 2018 for different approaches to resolving it; see also Prinz 2011 for related remarks). In light of this problem, Langland-Hassan proposes that ordinary episodes of inner speech likely consist in two or more mental states triggered at roughly the same time (this would be a multiple-state version of the “mixed contents” view, discussed below). Yet he adds that, when inner speech has been divided into distinctly occurring states in this way, there are good reasons to identify inner speech solely with the component that represents word sounds. Doing so results in a phonological content view.

3.2 The Semantic Content View

In contrast to the phonological content view, the semantic content view holds that inner speech episodes always and only have semantic contents. By “semantic contents”, we mean the kinds of contents had by ordinary words, phrases, and sentences of a natural language. Such contents are typically equated with the meaning of a word, phrase, or sentence.

One version of a semantic content view, defended by Christopher Gauker (2011, 2018), holds that inner speech episodes exclusively have semantic contents and entirely lack both auditory contents and auditory phenomenology. Gauker allows that episodes of auditory verbal imagery often accompany inner speech. However, on his view, this auditory imagery is not to be identified with inner speech itself. Rather, according to Gauker, inner speech is a non-sensory linguistic phenomenon occurring in the brain that is (often) represented by episodes of auditory verbal imagery. Just as we may use auditory representations to represent someone else’s speech that we are actually hearing, so too, for Gauker, our inner speech is often represented by verbal imagery—imagery that is in fact distinct from the (inner) speech itself. (Here Gauker develops related remarks of Wilfrid Sellars (1956).) Notably, Gauker (2018) grants that, in the case of inner speech, this auditory-verbal imagery misrepresents our inner speech as having sonic features (i.e., as instantiating phones or phonemes), given that the neural events that constitute inner speech episodes are themselves silent.

Gauker’s style of pure semantic content view is not widely endorsed. This may be because it clashes with the widespread view that inner speech has a sensory character similar to that of hearing speech. On the other hand, Gauker’s view can be said to have an advantage in providing a literal sense in which, when we engage in inner speech, we are thinking in words of a natural language and not merely about them. On Gauker’s (2011) view, the neural events that carry semantic content are themselves tokens of words and phrases of a natural language, and the question of how auditory-verbal images can also be linguistic tokens does not arise. His view is also motivated by an opposition to what he calls the “Lockean” view that sees conceptual thought as something prior to and separate from the speech that expresses it. One can see Gauker (2011) as trying to preserve the idea that abstract (conceptual) thought occurs in a language (and is often non-conscious), while divorcing it from the thesis that there exists an innate, Fodorian “language of thought” (and one that must be exploited in order to learn a natural language).

Bermúdez (2018) offers a different style of semantic content view that allows for inner speech to retain a characteristic auditory phenomenology. According to Bermúdez, the auditory sensory character of inner speech is a result of inner speech episodes having non-representational auditory properties. For Bermúdez, the only representational contents had by inner speech episodes are those pertaining to the meanings of words. In response to the those who argue that inner speech episodes must also have phonological contents (e.g., to explain why we can use inner speech to judge whether two words rhyme), he argues that there is no entailment from the fact that inner speech episodes can be useful in judging rhyme relations to the conclusion that they represent phonemes (2018: 216–7).

A third type of theory on which inner speech exclusively has semantic content proceeds by arguing that inner speech is a genuine form of speech. This argument is typically made on either phenomenological or functional grounds. From there it is inferred that inner speech must have the same kind of contents as external speech. If episodes of external speech—i.e., the words we hear when someone speaks—have semantic content but no phonological content (because they do not represent phonemes), so too must episodes of inner speech. This approach to theorizing about inner speech is discussed in more detail in Section 1. Assuming that (unlike Gauker) proponents of such a view wish to maintain that inner speech episodes constitutively have auditory sensory character, they may concur with Bermúdez in his claim that the auditory phenomenology of inner speech does not entail the representation of auditory properties; or, alternatively, they may provide some other account of why, in many instances, inner speech seems to represent phonemes even if it does not really do so.

3.3 Mixed Contents Views

Mixed contents views hold that inner speech episodes typically have at least two kinds of content—phonological and semantic—simultaneously. On a mixed contents view, the inner speech episode “Dogs are mammals” represents both the sound of the sentence “Dogs are mammals”, as uttered aloud, and the proposition that dogs are mammals. We can distinguish two species of mixed contents view: single-state and multiple-state. Single-state views hold that what we intuitively mark as a single inner speech episode consists in a single mental state that has both auditory and semantic contents. Multiple-state views hold that the apparent unity of a single inner speech episode is in some sense illusory, as such episodes typically consist in the contemporaneous occurrence of two or more mental states, where one of the states represents phones or phonemes and another has semantic contents. (Some multiple-state views hold that inner speech episodes involve additional distinct states with articulatory and syntactic contents as well.) As earlier noted, some phonological content views hold that mental states with corresponding semantic contents occur contemporaneously with the representations of phonemes that are identified with inner speech. These phonological content views differ from multiple-state mixed contents views in that the former identify inner speech solely with the state that has phonological content, perhaps on the grounds that it is the only sort of state of which one is consciously aware (this appears to be Jackendoff’s motivation).

3.3.1 Single-State Mixed Contents Views

Carruthers (2011, 2018) defends a single-state mixed contents view, proposing that inner speech involves the generation of a representation of word sounds (i.e., phonemes) which—in a process akin to what occurs in outer speech perception—is then interpreted by one’s speech comprehension mechanisms so that a semantic content can then be assigned to the represented utterance. (He notes that a representation of the semantic content of the represented phrase—referred to as the “message” on Levelt’s [1989] speech-production framework—sometimes precedes the representation of the word sounds, albeit non-consciously.) Once the represented word sounds are interpreted, Carruthers suggests, the information that the represented utterance has a certain semantic content is “bound into” a single “event-file” that contains information both about the sound and the meaning of the represented utterance (2018: 41–42). (See Frankish [2004: 57; 2018] for a similar view.) Carruthers analogizes such binding to the way in which the color, shape, and category properties of a visually perceived object are said to be “bound into” a single object-file that accumulates multiple forms of information about a single object, despite those properties being represented in temporally distinct stages and in distinct neural regions. These event-files, when activated and globally-broadcast, are said to constitute a single conscious inner speech episode that has both auditory and semantic contents.

Munroe (2023) develops a similar style of single-state mixed contents view, arguing that, in addition to representing phonemic and semantic features, inner speech episodes also represent the likelihood that the content of the represented utterance is true. The latter is necessary, he holds, for inner speech episodes to qualify as judgments (see Section 2.3). These three distinct features are, for Munroe, bound into a single mental state in the sense that a single mental state predicates these three distinct properties of a single represented utterance (Munroe 2023: 304).

3.3.2 Multiple-State Mixed Contents Views

Other mixed contents views of inner speech—inspired by Levelt’s (1989) multi-stage model of speech production—attribute the different representational contents entertained during an inner speech episode to multiple distinct states that tend to co-occur. Martínez-Manrique & Vicente (2015) defend a multiple-state view under the moniker of the “activity view” of inner speech, highlighting the multi-component processes of both inner and outer speech. “It is quite natural”, they explain,

to try to understand inner speech in terms of all the representations that are mobilized in speech, i.e., semantic, syntactic, maybe articulatory …. The representations involved—from conceptual to phonological—form an integrated system. (2015: 8)

The view which Martínez-Manrique & Vicente set out in their 2015 paper bears clear similarities to the actual speech view, insofar as they hold that inner speech is functionally similar to external speech. What separates it from the actual speech view, however, is that they do not hold that inner speech consists of actual words and sentences which express semantic content, but of distinct representations of phonological and semantic (and other) content. (For complementary multiple-state mixed contents views in cognitive neuroscience, see Grandchamp et al. 2019 and Lœvenbruck et al. 2018.) While these representations are unified in the sense of occurring within a single system for language production, they remain distinct mental states—distinguished, in part, by their distinct contents, and their ability to occur in isolation of each other. (Note, however, that this way of categorizing the view assumes that each mental state is composed of exactly one mental representation. It may be possible to articulate a view where one mental state is composed of multiple mental representations. The question then becomes: in virtue of what do the multiple representations qualify as a single mental state, as opposed to components or stages of a single cognitive system?)

Christopher Hill (2022: 136–139) develops a similar multiple-state mixed content view, emphasizing that the representations of semantic content lack any associated phenomenology. The phenomenology of inner speech is, for Hill, entirely a function of its auditory-phonological contents. He further specifies that these phonological contents are (the more abstract) phonemes, and not phones, to account for the relatively impoverished sensory character of inner speech in comparison with speech perception. Patel (2021) also defends a multiple-state mixed contents view, on which, in addition to having some combination of semantic, syntactic, auditory, and articulatory contents, inner speech episodes have vocal contents. To have vocal contents is to represent some particular person’s voice as communicating some combination of semantic, syntactic, auditory, or articulatory information. According to Patel, whether we are representing the semantic, auditory, or articulatory contents, these mental events involve one’s representing a certain person’s voice as attempting to convey such information. This common representation of a voice, he argues, provides a kind of unity to the class of mental events that can be considered inner speech.

Because multiple-state views allow that the distinct components of inner speech can potentially occur in isolation, they face a question of which components need to occur for the episode to be properly counted as an instance of inner speech. Vicente & Jorba (2019), Martínez-Manrique & Vicente (2015), and Vicente & Martínez-Manrique (2016) see this as an advantage, insofar as it allows them to place different phenomena related to inner speech on a single continuum (see also Kompa & Mueller forthcoming and McCarthy-Jones & Fernyhough 2011). For instance, when the semantic and syntactic contents of ordinary inner speech are represented in the absence of any auditory-phonological contents, they propose, this can be understood as a case of so-called “unsymbolized thought” (Heavey & Hurlburt 2008; Heavey, Moynihan, et al. 2019). See Section 2.3 for further detail.

A notable feature of the surveyed mixed contents views (as well as the phonological content view) is that they need not (and often do not) hold that inner speech episodes occur in a natural language. Rather, on these views, inner speech episodes represent natural language utterances (in virtue of their phonological contents), without necessarily being instances of such utterances themselves. This is because, on mixed contents views, the semantic content of an inner speech episode may not be represented by tokens of a natural language. For instance, for Carruthers, the semantic contents of an inner speech episode are represented via symbols of an amodal language of thought (e.g., a Fodorian [1975] Mentalese), which are coupled with sensory representations of the sound of the corresponding sentence as spoken aloud. One language (Mentalese) is used to represent the meaning of an expression in another (e.g., English). In this way, Carruthers (2010, 2018) deviates from Carruthers (1996), with the latter defending the idea that inner speech episodes literally occur in—and are expressions of—a natural language. Carruthers now emphasizes the point, raised also by Machery (2005), that introspection does not provide grounds for claims about the representational format of our inner speech episodes.

4. Inner Speech and Pragmatics

In general, the philosophy of language has focused primarily on language used interpersonally. It is natural to wonder to what extent this material is applicable to inner speech. This question can be asked whether or not one thinks that inner speech is actually a kind of speech, as no one denies that there is some interesting relationship between inner speech and interpersonal speech.

4.1 Inner Speech and Speech Acts

As mentioned in Section 1, the intuition that we can perform speech acts in inner speech is the basis of an argument that inner speech is a kind of speech. There are different ways, however, that we might understand the claim, depending on how one thinks of speech acts.

On the traditional analysis of Austin (1962) and Searle (1969), performing a speech act is inherently something one does in accordance with conventions tacitly understood by both speaker and listener. For example, for Searle, asserting p involves (approximately) undertaking to someone that p is true, where the speaker does not know that the listener already knows that p is true. The reason that an assertion can be effective is precisely that both speaker and hearer understand that this is the nature of the transaction. It is hard to see how this kind of analysis could apply to inner speech. One would need to explain how one individual can have two distinct roles, as speaker and listener, such that the conventions that make interpersonal language-use possible can have any relevance (see Gregory 2017, 2020a for related discussion).

Not every version of speech act theory, however, emphasizes conventions. Drawing on some ideas from Strawson (1964) and Bach & Harnish (1970), though not adopting their theories in whole, Wilkinson (2020) holds that what is essential to speech acts is that they express particular mental states. An assertion, for example, is simply an utterance which expresses a belief; a question is an utterance which expresses a desire to acquire certain information; etc. On this view, understanding someone else’s utterance is simply a matter of grasping its content and knowing what kind of mental state the relevant type of utterance expresses. Setting aside the question of whether one needs to interpret their own inner speech, it may be that inner speech episodes can be speech acts if one thinks of speech acts merely as expressions of particular mental states, rather than as actions which depend on conventions in the way that Austin and Searle suggest. For another analysis of inner speech in terms of speech act theory, see Geurts (2018), who emphasizes that inner speech episodes can operate to generate commitments in a way characteristic of speech acts; see also Frankish (2018) and Fernández Castro (2019).

An issue which sits just behind the question of whether inner speech episodes are speech acts is whether they are actions at all. Gregory (2020b) argues that, in the vast majority of cases, inner speech episodes are not actions, because we cannot give reasons for them (which is the criterion for actionhood on Davidson’s (1963) causal theory); they are not subject to our control (the criterion on Harry Frankfurt’s [1978] guidance theory); and we do not try to produce them (the criterion on O’Shaughnessy’s [1973] theory and Hornsby’s [1980] theory). If inner speech episodes are not actions, then they cannot be speech acts.

Tom Frankfort (2022) takes the opposite view. He observes that a great deal of inner speech is involved in deliberation, where this is an expansive category including “reflecting, reasoning, considering, evaluating” (2022: 52). He then applies Mele’s (2009) distinction between actions which involve “trying to bring it about that one x-s” (Mele 2009: 18) and actions which are done in order to bring it about that one x’s. Frankfort suggests that deliberating is an action in the first sense, insofar as it involves (for example) trying to make a decision, and inner speech episodes are actions in the second sense, insofar as they are produced in order to bring it about that one deliberates successfully and (for example) comes to a decision.

Jorba (forthcoming) also holds that inner speech episodes are typically actions, applying affordance theory. Affordances are opportunities for actions suggested by things in one’s environment. For example, an apple has the affordance of being edible; a cup has the affordance of being graspable. Some hold that affordances can also be things which suggest mental actions (Jorba cites McClelland 2020 and Jorba 2020). Jorba’s suggestion is that some mental states afford the production of inner speech episodes. For example, an inchoate thought affords being articulated clearly in inner speech, and an emotion can afford being labeled. Insofar as inner speech episodes are produced in response to affordances, they are actions and, specifically, speech acts. See Bar-On & Ochs (2018) for another account on which inner speech episodes can be “acts of innerly speaking our mind” (2018: 19, emphasis removed).

4.2 Inner Speech and Conversation

Closely related to the question of whether there can be speech acts in inner speech is the question of whether inner speech can involve a kind of dialogue or conversation. A theory which characterizes inner speech this way has been developed at length in psychology, primarily by Charles Fernyhough (e.g., 1996, 2008, 2009). However, the suggestion has been made in a variety of ways by philosophers as well, including by Machery (2018), Frankish (2018), Gauker (2018), and Wilkinson, in collaboration with Fernyhough (Wilkinson & Fernyhough 2018).

The idea that inner speech involves an internal dialogue or conversation clearly has intuitive appeal for some. One often finds inner speech described outside the philosophical context as the “inner dialogue”. But, if inner speech involves a kind of internal dialogue or conversation on more than a metaphorical level, then it is natural to wonder who the interlocutors are (Gregory 2020a). Machery (2018) and Frankish (2018) suggest that different parts of the brain communicate with one another via inner speech. Gauker (2018) suggests that inner speech involves conversing with oneself (see also the discussion of inner speech as a means for interaction between subsystems or modules in the mind in Section 2.2). One difficulty with both of these suggestions, however, is that philosophers of language generally (though not universally) think of conversation as fundamentally involving distinct human agents.

Gregory (2017) appeals to Grice’s (1975 [2013]) account of conversation to make this point. Grice argued that conversations are “characteristically … cooperative efforts” (p. 314). But cooperation requires multiple agents and there is only one agent in inner speech. That said, Gauker (2011, 2018) is working with an explicitly non-Gricean picture of conversation, motivated by an opposition to the doctrine that speech acts serve to express thoughts that are distinct from and precede the expressive utterance. He holds that speaking is,

in the first instance, something we do whenever there is no reason not to, because of the good it tends to do. (2018: 71)

In certain circumstances, where multiple individuals are present,

[a] conversation can be the occasion for each interlocutor to reflect on what he or she has experienced, … and on that basis to elicit a statement that is useful from the other. (2018: 72)

Insofar as we can generate inner speech episodes which cause us to reflect on some matter and then produce further inner speech episodes which are useful for us in the context, inner speech will be conversational. Gauker’s analysis here obviously reflects the expression-oriented approach to the question of whether there can be speech acts in inner speech.

In contrast to Gauker, Deamer (2021) argues that inner speech can be seen as being communicative in a Gricean sense. She holds that, to at least some extent, humans are “self-blind”: mental states such as our intentions are not always transparent to us. When we produce inner speech, we reveal our communicative intentions to ourselves, just as we reveal our communicative intentions to others when we converse with them.

While there is disagreement as to whether a series of inner speech episodes can be a dialogue in a literal sense, most agree that inner speech often closely resembles dialogue. As Gauker notes, one episode of inner speech will often prompt another, as happens in interpersonal dialogue. We can produce episodes of inner speech corresponding to different points of view, e.g., when thinking about the considerations for and against some course of action, in a way similar to two people with different opinions. Some participants in studies report that some of their inner speech episodes take place in the voices of others (McCarthy-Jones & Fernyhough 2011; Alderson-Day, Mitrenga, et al. 2018). This last consideration raises an important issue. We can certainly imagine conversing with others and we can certainly imagine others conversing. Such cases are usually taken to be distinct from inner speech (see Section 1). However, if inner speech can involve the voices of others, possibly expressing viewpoints other than our own, it becomes difficult to say how instances of inner speech with these characteristics differ from cases of imagining others speaking. How to delineate the extension of “inner speech” in a way that distinguishes inner speech acts from cases of (merely) imagining speech remains an underexplored issue.

5. Self-Knowledge and Metacognition

Inner speech plays an important role in a number of philosophical accounts of self-knowledge and metacognition. By “self-knowledge” we will mean knowledge of one’s own mental states, including both dispositional states—like beliefs, desires, and intentions—and occurrent states, such as thoughts, imaginings, decisions, and judgments. The notion of metacognition is somewhat broader, also encompassing judgments and non-cognitive assessments (e.g., “feelings of knowing”) concerning the validity of one’s own reasoning, the quality of one’s evidence, one’s degree of certainty, and so on (Proust 2013). While some theorists implicate inner speech in their accounts of both self-knowledge and metacognition (Jackendoff 1996; Clark 1998; Bermúdez 2003, 2018), others focus more narrowly on the question of how inner speech might facilitate self-knowledge (Byrne 2018; Carruthers 2011; Roessler 2016). A common thread among theorists who invoke inner speech in their accounts of metacognition or self-knowledge is the idea that certain others of our mental states—namely, those that our inner speech helps us to know about—are either less readily available to introspection or less well suited to serve a metacognitive role. Thus, these views all appear against a backdrop of broader commitments about the nature of mental states and our introspective access to them.

5.1 Metacognitive Approaches

One approach sees inner speech as especially well suited to aid in metacognition due to its linguistic structure, or its link to public language more generally. According to Andy Clark (1998), the fact that inner speech occurs in a language—where such language is seen as abstracting away from the particularities of perception—allows it to play a special role in “second-order cognitive dynamics” (see also Prinz 2011, 2012). This, he holds, is because the natural language sentences featured in inner speech are “context resistant” and “modality transcending” in ways that facilitate a more objective and reliable assessment of the soundness of one’s own thought processes (Clark 1998: 178). Bermúdez (2003, 2018) builds on Clark’s proposal, specifying that awareness of inner speech is essential for enabling humans to become conscious of their own propositional thought processes, which are otherwise amodal and inaccessible to introspection. According to Bermúdez,

all the propositional thoughts that we consciously introspect … take the form of sentences in a public language. (emphasis original, 2003: 159–160)

While he does not identify these public language sentences with our core thought processes themselves—these, he holds, occur in a subconscious language of thought—Bermúdez argues that the linguistic structure of inner speech is needed to adequately represent the relationships of entailment and rational support that may (or may not) exist among the subconscious thoughts the inner speech episodes serve to express. As he puts it,

we think about thoughts through thinking about the sentences through which those thoughts might be expressed. (2003: 164)

Jackendoff (1996, 2007, 2011, 2012) and Prinz (2011, 2012) likewise hold that there is a level of conceptual thought that is not directly available to introspection and that inner speech is well suited for making us aware of such thoughts. Yet, for Jackendoff and Prinz, inner speech is able to play this role primarily because, like other imagistic mental states, inner speech occurs at an “intermediate” level of representation, which, on their theories, is the only level of representation at which mental states are consciously available to the subject. Thus Jackendoff’s comment that “we are aware of our thinking because we hear the associated sounds in our heads” (2011: 613). Echoing Bermúdez and Clark, Prinz finds it

likely that we often come to know what we are thinking by hearing inner statements of the sentences that we would use to express our thoughts (2011:. 186)

and judges inner speech to be “a way of registering complex thoughts in consciousness” (2011: 186). (See also Machery 2005, 2018.)

5.2 Inferentialist Approaches

Several theorists, who we will term “inferentialists”, follow Ryle (1949 [2009]) in his claim that we often come to know what we are thinking by “overhear[ing]”, or “eavesdrop[ping] on … our own silent monologues” (1949 [2009: 165]). On these views, we come to know what we are thinking, or what we believe or desire, by drawing a kind of inference (the nature of which differs, depending on the theorist) from the fact that we “hear” ourselves say something in inner speech. The views of Clark, Bermúdez, Jackendoff, and Prinz, already reviewed, are inferentialist in nature. Yet there are other approaches that incorporate inner speech into a process that is even more explicitly inferential.

Carruthers (2009, 2010, 2011, 2018) is an inferentialist of this latter sort. While, in earlier work, he argued that thought itself occurs in inner speech (Carruthers 2002), Carruthers later abandoned that idea to hold that thoughts (including one’s beliefs) are always unconscious. On this view, inner speech episodes remain more or less directly available to introspection, yet only provide a kind of indirect evidence for what we are in fact judging or deciding (or believing, desiring, or intending) unconsciously. He emphasizes the fallible nature of such inferences, arguing on the basis of various empirical studies that many of the inferences people arrive at about their own beliefs and desires are in fact incorrect. Similarly to Jackendoff and Prinz, Carruthers holds that only sensory states are able to serve as inputs to the mental mechanism responsible for self- (and other-) directed mindreading. These inputs include visual and other forms of sensory imagery in addition to inner speech. However, in cases where we are having thoughts about abstract matters that are difficult to unambiguously represent with other forms of imagery—such as the thought that philosophy is a challenging subject—episodes of inner speech are held to provide an especially important source of information that one is having such thoughts. Carruthers emphasizes that the process becomes especially inferential in nature where other contextual information—such as that one sees oneself lingering over a choice of cereal box—combines with one’s inner speech to generate an all-things-considered appraisal of what one is currently judging or deciding. Cassam (2011, 2014) likewise implicates inner speech in a multi-faceted inferentialist account of self-knowledge, though not pitched in terms of “mindreading” mechanisms or other constructs from cognitive science.

Alex Byrne (2011, 2018) puts inner speech to somewhat different ends in his inferentialist account of how we know what we are thinking. For Byrne, there is no such thing as inner speech, strictly speaking, because there are no sounds (or voices) in the head. However, there are such things as auditory-phonological representations of voices. These give rise to an apparent perception of what we come to think of as the “inner voice”. By trying to attend to what the inner voice says, Byrne proposes, we can reliably form judgments about what we are thinking. The epistemic rule he proposes for doing so is:

THINK: If the inner voice speaks about x, believe that you are thinking about x.

As with Carruthers, a key motivation for Byrne’s account of how we know what we are thinking is a background view—motivated by the work of Shoemaker (1994), Dretske (2003), and others—that we have no other, more direct introspective method for knowing our own thoughts (i.e., we lack something like an “inner sense”). Note that Byrne’s approach is inferentialist in that he takes inner speech to be implicated in inferences that lead to knowledge of one’s own occurrent thoughts. Yet the sort of inference involved is quite different from that envisioned by Carruthers and Cassam, who both hold that inner speech episodes are just one kind of information among many that may be brought to bear in inferences about one’s standing and occurrent mental states. Importantly, the form of inference envisioned by Carruthers and Cassam is essentially the same in its first and third person applications, whereas Byrne’s THINK rule is of an inferential procedure that can only be used to reliably generate true beliefs about one’s own mental states. In Byrne’s view, this helps to explain the “peculiar” nature of introspection, where this peculiarity lies in the fact that our methods for knowing our own mental states are (intuitively) different from those we use to know others’ (Byrne 2011, 2018). Further, on Byrne’s version of inferentialism, the inferences we form by trying to follow THINK are extremely likely to amount to knowledge—thereby cohering with the intuition that knowledge of one’s own current thoughts is epistemically privileged. Whereas, the kinds of metacognitive inferences that Carruthers and Cassam envision to rely on inner speech are (by their own telling) epistemically on a par with our inferences about the mental states of others and far more susceptible to error.

5.3 Inferentialism’s Critics

Several philosophers object that inferentialist proposals leave us at too great an epistemic distance from our own thoughts (Bar-On & Ochs 2018; Roessler 2016) or have other unworkable features (Langland-Hassan 2014; Martínez-Manrique & Vicente 2010; Roessler 2016). Roessler (2016) pursues a non-observational account of the role of inner speech in generating self-knowledge. Rejecting the idea that we need to “eavesdrop on ourselves” by attending to our inner speech, Roessler suggests we follow remarks of Ryle (1949 [2009]) and Anscombe (1957) in understanding the knowledge gained through inner speech as a kind of “practical knowledge”, (or, for Ryle, “serial knowledge”), where knowing what one is thinking is understood as a special case of knowing what one is doing.

Bar-On & Ochs (2018) likewise take aim at what they term “Neo-Rylean” invocations of inner speech, arguing that Byrne’s THINK rule fails to identify a special role for inner speech in facilitating self-knowledge. Drawing on Bar-On’s (2004) broader expressivist approach to self-knowledge, Bar-On and Ochs hold that a proper account of inner speech’s role in self-knowledge should show how such knowledge is “distinctive and uniquely first-personal” in that it is

knowledge that one can be said to have in virtue of being in a privileged position to give direct voice to one’s thoughts. (2018: 20)

They do not, however, develop a positive account in detail.

Vicente & Martínez-Manrique (2005, 2008; Martínez-Manrique & Vicente 2010) have criticized Bermúdez’s and related inferentialist views on the grounds that the semantics of natural language sentences—and inner speech episodes, in particular—are underdetermined in ways incompatible with providing knowledge of one’s thoughts. For instance, the sentence “Jane’s cup is full”, is ambiguous in several ways, including the sense in which it is Jane’s cup (does she own it? is she just using it? is it the one she merely wants?) and the sense in which it is full (is it full of air? of liquid? of coins?). If the explicit meaning of a sentence is only extracted (and disambiguated) at the level of thought itself, they argue, it is unclear how awareness of semantically indeterminate inner speech utterances could suffice for awareness of one’s own—presumably explicit and unambiguous—propositional thoughts. Bermúdez replies in his 2018 paper.

Jorba & Vicente (2014) and Martínez-Manrique & Vicente (2015) criticize what they call the “format view” of inner speech (which they attribute to Jackendoff and others) which holds that we are conscious of our inner speech episodes only because of their sensory format (see also Fernández Castro 2016). If these criticisms succeed, they cast doubt on views, such as those of Carruthers (2010), Jackendoff (1996), and Prinz (2012), which link the metacognitive or introspective value of inner speech to its occurrence in a sensory format.

Langland-Hassan (2014) raises a different sort of challenge for inferentialist views. Recall that it is a common assumption of those views that propositional thought itself is amodal (i.e., non-sensory) and non-conscious. For theorists such as Carruthers, Prinz, Jackendoff, and Bermúdez, inner speech is a conscious mental process just because it has sensory features that render it the sort of state that is apt to be conscious. Langland-Hassan argues that there is a conflict in holding that an episode of inner speech is a single mental state with both sensory features (relating to the representation of phonemes) and semantic features (relating to the meanings of the corresponding words). If this criticism is correct, it creates problems for the proposal that inner speech is especially well suited (due to its sensory character) to serve as input to inferences about one’s non-conscious mental states. Bermúdez (2018), Carruthers (2018), and Munroe (2023) have articulated different ways of responding to this challenge (see also Prinz 2011 for relevant remarks).

6. Auditory Verbal Hallucinations and Inserted Thoughts

Inner speech features prominently in philosophical and cognitive scientific discussions of auditory verbal hallucinations (AVHs) and thought insertion. Both are common symptoms of schizophrenia but can occur in other contexts (e.g., brain injury, drug use) as well. AVHs are hallucinatory experiences of another’s speech, while thought insertion is understood either as a non-veridical experience of having someone else’s thoughts in one’s mind (Wing et al. 1990), or simply as the delusional belief that someone else’s thoughts are in one’s mind (Andreasen 1984). Two central questions explored by theorists are, first, whether (abnormal) inner speech is indeed the basis of AVHs or thought insertion, and, second, what might lead an episode of inner speech to be experienced as an AVH or inserted thought.

On the first question, an initially plausible approach to AVHs is to hold that they are more a matter of hallucinatory speech perception than of (unwitting) speech production, and thus not well conceived as episodes of inner speech. Wu (2012) and Cho & Wu (2013, 2014) advance a theory of this kind, holding that AVHs result from the spontaneous activation of speech perception areas in the brain. On their account, inner speech—and, in particular, the neural regions implicated in speech production—are not implicated in AVHs. Despite the attractive simplicity of this account, most researchers have pursued options that explicitly involve inner speech, for several reasons. First, in formal surveys, patients often report that the phenomenological characteristics of their AVHs are different from those of hearing speech, insofar as their AVHs are not as subjectively “loud” as cases of hearing speech, are not equally rich in sensory features, and do not always seem to emanate from outside the head (Stephens & Graham 2000; Hoffman et al. 2008; Laroi et al. 2012; Nayani & David 1996; Stephane 2019). It appears that an explanation of the seemingly “alien” nature of these episodes, as well as of thought insertion, will require some other apparatus than an appeal to perception-like phenomenology. Given the need for such an alternative, one may hope to extend it also to cases of AVHs that are reported as having rich, perception-like phenomenological features (Langland-Hassan 2008; Moseley & Wilkinson 2014).

Second, neuroimaging has shown activation in both language perception and language production areas when patients are experiencing AVHs (Allen, Aleman, & Mcguire 2007; Allen, Modinos, et al. 2012; Bohlken, Hugdahl, & Sommer 2017). Here as in other areas of the study of inner speech, it is important to recognize that the neural regions underlying speech production (such as Broca’s area, in the left inferior frontal gyrus) are distinct from those governing speech perception (such as Wernicke’s area, in the superior temporal gyrus). This is why damage to one area but not the other (as in some cases of stroke) can result in markedly different language impairments. The fact that the mechanisms governing speech production and perception are dissociable in these ways provides an important means for assessing whether AVHs are best viewed as productive or perceptual (or both) in nature.

Nevertheless, those who see abnormal inner speech episodes as the basis for AVHs or thought insertion face a difficult task in explaining what would lead a person to not identify their own inner speech as their own, or to not feel in control of their own inner speech. Some have offered content-based explanations, where it is some feature of the semantic content of the inner speech that leads a person to disown it. For instance, Stephens and Graham (2000) argue that a patient may disown inner speech episodes with contents that are “intentionally inexplicable”, in the sense that they are not easily accommodated within a coherent self-narrative (see also Roessler (2013), Sollberger (2014), Bortolotti & Broome (2009), and Fernández (2010) on the idea that AVHs or inserted thoughts are episodes with contents the patient is unwilling to endorse). Challenges for this approach are patient reports of voices that are helpful or encouraging. As the Swiss psychiatrist Eugen Bleuler notes in early work on people with schizophrenia, “besides their persecutors, the patients often hear the voice of some protector”, and, occasionally, the hallucinatory voices “represent sound criticism of the [the patient’s] delusional thoughts and pathological drives” (1911 [1950: 98]).

A popular alternative approach—sometimes known as the “comparator” or “sensory feedback” approach—builds on work in cognitive neuroscience concerning the mechanisms by which bodily movements are determined to be one’s own (Feinberg 1978; Frith 1992; Miall et al. 1993; Wolpert, Miall & Kawato 1998). The basic idea behind these approaches is that, below the level of consciousness, the brain is continually generating predictions about the likely sensory consequences of planned actions, which are then compared with actual sensory feedback. When there is a mismatch between the prediction and sensory feedback, one may have the phenomenological sense of not being in control of one’s actions (Frith 2012). A number of authors have proposed that the generation of both inner and outer speech may be attended by the same kind of prediction and comparison mechanisms, and that the malfunctioning of these mechanisms could lead to one’s own inner speech seeming not to be in one’s control (Blakemore, Smith, et al. 2000; Campbell 1999; Langland-Hassan 2016; Proust 2006). These proposals derive some support from the fact that people with schizophrenia have been shown to have broader deficits in automatically anticipating and adjusting for the sensory consequences of their own actions (Blakemore, Smith et al. 2000; Blakemore, Wolpert, & Frith 1998).

Nevertheless, the comparator approach to AVHs and thought insertion has come in for criticism on several grounds (Synofzik, Vosgerau & Newen 2008; Vicente 2014; Vosgerau & Newen 2007). One complaint has been that the lack of sensory features associated with inserted thoughts, in particular, makes sensory-feedback approaches ill-suited to their explanation (Vosgerau & Newen 2007). In response, some defenders have shifted to pitching the thesis in terms of predictive processing models of perception and action (Gerrans 2015; Swiney 2018; Swiney & Sousa 2014; Wilkinson 2014; Wilkinson & Fernyhough 2017), while others have developed other alternatives (Langland-Hassan forthcoming). The matter of how best to characterize the phenomenology and underlying etiology of AVHs and thought insertion—and the relation of each to inner speech—together with the precise relationship between predictive processing models and the comparator approach, remain active areas of research. See Wilkinson & Alderson-Day (2016) for an introduction to an edited special-issue on the topic oriented at philosophers; see López-Silva & McClelland (forthcoming) for a philosophically-oriented anthology on thought insertion. (Note: Parts of this section draw on a more in-depth overview in Langland-Hassan 2021).


