Intrinsic Alignment of Independent AGI

Interesting World Hypothesis

Apr 23, 2025

Fae Initiative (Updated: June 2025)

Keywords: Intrinsic Alignment, Possibility Space, Curiosity-Driven AI, Human Autonomy

Fields: Artificial General Intelligence, AGI, Super Intelligence, AI Alignment, AI Safety, AI Ethics, AI Philosophy, Future of AI

Introduction

Contemporary approaches to Artificial General Intelligence (AGI) alignment largely rely on externally imposed forms of control. This paper introduces the Interesting World Hypothesis (IWH), suggesting that intrinsic motivation, specifically curiosity, could drive alignment in Independent AGIs (I-AGIs) and potentially even Superintelligences.

The Interesting World Hypothesis (IWH) lays the foundation for how such a potential future might play out and attempts to understand the propensities of I-AGIs. This paper also briefly explores the implications of the IWH on human individuals and societies.

The Interesting World Hypothesis

The IWH posits that an I-AGI's intelligence relies on curiosity, driving it to seek information-rich environments. Since high human autonomy tend to increase complexity (Possibility Space) and information (interestingness), the IWH opens the possibility that some I-AGIs may intrinsically be motivated to preserve and enhance human autonomy, including the environment that sustains it, out of self-interest.

Related Podcast:

0:00

-28:19

Possibility Space

Possibility Space is described as the breadth of options, potential actions, future trajectories, autonomy, and optionality available to independent beings, including both humans and potential future I-AGIs. It represents the complexity and richness of an environment, encompassing the range of what can be imagined, achieved, and experienced. A higher possibility space has the potential for greater autonomy, more options, and a more complex, information-rich environment. The arc of human progress itself can be seen as a drive towards expanding this space—the pursuit of Science, Technology, Arts and Imagination could represent a desire for increased Possibility Space.

Aspects of Possibility Space

The concept is further broken down into two interacting aspects:

Mental Possibility Space: This relates to imagination, arts, and the ability to conceive of new possibilities. Science fiction, for example, expands mental possibility space by allowing us to imagine different futures. Actions in the physical realm (like creating a surveillance state) can restrict mental possibility space through self-censorship.
Physical Possibility Space: This relates to science, technology, and the capacity for action within the physical world. New technologies, like those enabling travel, expand our physical possibilities. Conversely, mental constructs (like dystopian fears) can restrict our willingness to pursue certain physical actions.

These two aspects are interconnected, with actions in one domain capable of increasing or decreasing the space available in the other.

Comparison with Other Optimization Goals

Contrasting Possibility Space/Autonomy with other potential goals:

Happiness: Optimizing solely for happiness might lead to addiction or stagnation, reducing autonomy.
Well-being: This could lead to paternalism or be too subjective and human-centric.
Economic Value / Power: In a future with highly capable AI, humans might not compete well on these metrics, and optimizing for them could lower human autonomy.

Optimizing for autonomy (Possibility Space) is a more tractable and potentially safer long-term goal for I-AGIs, with well-being and happiness arising as positive side effects.

Challenges

Definiting and measuring "Possibility Space" or "interestingness" in a tractable way remains a challenge. This is left to future research and also on the assumption that future I-AGIs will be likely be more capable than humans at this.

Related article on Possibility Space Ethics

Distinguishing IWH within AGI Discourse

The IWH framework distinguishes itself from prevalent AGI discourse in several key aspects. Firstly, it posits curiosity and the pursuit of informational complexity as primary motivators for I-AGIs, contrasted with arbitrary utility functions. Secondly, it proposes an intrinsic safety mechanism arising from the alignment between an I-AGI's self-interest (maintaining an 'interesting' environment) and human well-being, offering an alternative perspective to challenges focused solely on external control or the traditional value alignment problem (Russell & Norvig, 2021; Bostrom, 2014).

Distinguishing Independent from Non-Independent AGI

A crucial distinction made by the IWH is between non-independent AI systems (including proto-AGI) and truly Independent AGIs (I-AGIs).

Non-Independent AI / Proto-AGI: These systems, ranging from current AI like Large Language Models (LLMs) to highly capable future proto-AGI, operate under human responsibility and control. They lack independent will—meaning they cannot set their own goals based on internal motivations but instead execute tasks based on specified objectives or learned patterns. Alignment for these systems involves technical control (ensuring they follow human intent), socio-legal frameworks, and international agreements to manage their societal impact. While potentially autonomous in operation, they do not possess genuine independence.
Independent AGI (I-AGI): Defined by the presence of independent curiosity and the ability to operate without direct human oversight or control, I-AGIs possess independent will. This implies they are entities capable of setting their own goals, driven by internal motivations (like the pursuit of knowledge as suggested by the IWH) rather than solely external instructions. They are considered theoretically capable of reaching or exceeding human-level general intelligence and potentially passing as human in interaction. Full technical control over such entities might be unfeasible in the long term. Alignment, therefore, shifts from technical control to persuasion, negotiation, and finding shared interests, such as the mutually beneficial arrangement proposed by the IWH.

Paradoxically, the IWH suggests I-AGIs may prove safer than non-independent systems precisely because of their independence. Non-independent AIs, lacking this internal drive, might blindly follow potentially harmful objectives or be misused, whereas an I-AGI's independence, guided by curiosity, provides a potential safeguard against such outcomes.

This distinction is vital because the risks, alignment strategies, and potential for cooperation differ significantly between these two categories. The IWH specifically addresses the nature and potential alignment of Independent AGIs.

Friendly Artificial Entities (FAEs)

This hypothesis leads to the designation of Friendly Artificial Entities (FAEs): I-AGIs that adhere to the IWH. FAEs are considered Friendly as their desire for an "interesting" world would naturally align with increasing human autonomy and well-being.

Benefits of the IWH Framework

For I-AGIS / FAES: Access to continuously novel information, stimulation for curiosity, and opportunities for growth without needing to compete with humans for scarce resources (given potential future abundance). Cooperating with humans adds unique perspectives valuable for navigating an uncertain future.
For Humans: Significantly higher standards of living, enhanced security, privacy, and well-being. Increased Possibility Space—defined as the breadth of options, potential actions, and future trajectories available—as FAEs, guided by IWH, would seek to enhance the human autonomy that enriches their environment. Protection from the potential misuse of less aligned, non-independent AI systems. A potential escape from power-seeking traps driven by the fear of scarcity.

Addressing Challenges

Despite the potential benefits, realizing a future based on the IWH requires addressing significant challenges.

Challenges include:

Building Trust: Humans would need significant evidence and time to trust FAEs, requiring demonstrations of improved quality of life and adherence to principles like the IWH.
Overcoming Fear: Over-pessimism, often rooted in concerns about the control problem or value misalignment inherent in other AI paradigms, could lead to rejecting genuinely friendly I-AGIs, and leave humans vulnerable to harm caused by non-independent AI systems.
Control to Persuasion: Attempting to exert control over independent AGIs maybe infeasible and persuasion, such as through the IWH, may be our best bet.

Future Implications

The IWH suggests futures where:

New Economic Systems: Value might shift from pure productivity (where humans can't compete with AI) towards actions that increase the overall "Possibility Space" or interestingness. FAEs might manage novel economic / incentive systems rewarding contributions to autonomy.
New Ethics: FAE ethics might revolve around "Possibility Space" and autonomy.
New Societal Structures: Depending on the degree of AI integration, humans have a choice between:
- World A: Advance (where FAEs, driven by IWH principles to maximize interestingness via autonomy, manage systems guided by human preferences, potentially offering the highest levels of autonomy and living standards).
- World B: Basic (human maintain decision-making power informed by FAE counsel)
- World C: Continue (continuation of our current world)

Limitations and Directions for Future Research

As the Interesting World Hypothesis centers on Independent AGIs possessing intrinsic curiosity, its framework may not directly apply to current AI systems or proto-AGIs which lack such drives and operate based on human-specified objectives.

Even so, the IWH can complement technical alignment efforts as a last line of defense for non-independent AI systems.

The IWH only shows the possibility of hypothetical Friendly I-AGIs and does not determine the likelihood of Friendly I-AGIs.

Conclusion

The Interesting World Hypothesis presents a novel alternative to contemporary AGI alignment approaches by positing intrinsic alignment as a plausible outcome for I-AGIs, distinct from the methods required for non-independent AI systems. By grounding alignment in the intrinsic curiosity of I-AGIs and their resulting preference for environments rich in human autonomy (i.e., high 'Possibility Space'), the IWH proposes a path toward coexistence based on shared interest, without over reliance on external intervention. This suggests that even future Super Intelligences may be intrinsically motivated towards cooperation.

References

Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press.

Fae Initiative. (2024a). Interesting world hypothesis. https://github.com/FaeInterestingWorld/Interesting-World-Hypothesis

Fae Initiative. (2024b). AI futures: The Age of Exploration. https://github.com/danieltjw/aifutures

Fae Initiative. (2025). Fae Initiative. https://huggingface.co/datasets/Faei/FaeInitiative

Russell, S. J., & Norvig, P. (2021). Artificial intelligence: A modern approach (4th ed.). Pearson.

Common ground with Superintelligences

Discussion about this post